Re: [devel] [PATCH 1 of 5] NTF: Add support cloud resilience for NTF Agent [#1180]

2016-02-12 Thread minh chau
Hi Praveen,

Please find my comments inline [Minh]

Thanks,
Minh

On 12/02/16 22:10, praveen malviya wrote:
> Hi Minh,
>
> Please find some initial comments and questions marked with [Praveen].
>
> One question regarding the approach.
> In most of the changed APIs from saNtfInitialize() to 
> saNtfNotificationReadFinalize(), clients are being recovered. Why the 
> idea of recovering them is not considered in mds callback itself in 
> the function call ntfa_update_ntfsv_state() when server state changes 
> from DOWN to UP. It will save changes in APIs for recovering the 
> clients. Clients are being marked invalid in the mds_callback itself.
>
[Minh] From memory I did try to recover clients in mds callback 
(svc_evt) where indicates server UP. Though recovery needs to send 
messages (reintroducing client id, read id, ...) to server, that will 
end up another mds enc callback inside of first callback. That results 
into fail to send *recovery* messages, there are two mds callback but 
only one return. To solve this, I need to start a thread the recover 
clients in background. That would cause another problem where up calls 
from client could be blocked, since the recovery thread needs to block 
mutual resource handles.
Another realistic reason, client can give up to send/read/receive 
notification after receiving couple times of TRY_AGAIN due to server 
down, and doing nothing until finalize handle. And after client stops 
TRY_AGAIN loop, server can be restarted. In such cases, no needs to 
recover client handles.
> Thanks,
> Praveen
>
> On 23-Dec-15 9:32 AM, Minh Hon Chau wrote:
>>   osaf/libs/agents/saf/ntfa/ntfa.h |   31 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_api.c  |  672 
>> +++--
>>   osaf/libs/agents/saf/ntfa/ntfa_mds.c  |6 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_util.c |  465 ++-
>>   4 files changed, 1022 insertions(+), 152 deletions(-)
>>
>>
>> The patch contains support for cloud resilience feature
>> in NTF Agent code. Please refer README.HYDRA for content
>> of the changes
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa.h 
>> b/osaf/libs/agents/saf/ntfa/ntfa.h
>> --- a/osaf/libs/agents/saf/ntfa/ntfa.h
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa.h
>> @@ -91,6 +91,7 @@ typedef struct ntfa_filter_hdl_rec {
>>   typedef struct subscriberList {
>>   SaNtfHandleT subscriberListNtfHandle;
>>   SaNtfSubscriptionIdT subscriberListSubscriptionId;
>> +ntfsv_filter_ptrs_t filters; /* remember the filters used by 
>> this subscriber */
>>   struct subscriberList *prev;
>>   struct subscriberList *next;
>>   } ntfa_subscriber_list_t;
>> @@ -100,6 +101,10 @@ typedef struct ntfa_reader_hdl_rec {
>>   unsigned int reader_id;/* handle value returned by NTFS for 
>> this client */
>>   SaNtfHandleT ntfHandle;
>>   unsigned int reader_hdl;/* READER handle from handle mgr */
>> +
>> +ntfsv_filter_ptrs_t filters; /* remember the filters used by 
>> this reader */
>> +SaNtfSearchCriteriaT searchCriteria; /* remember the 
>> searchCriteria for recovery */
>> +
>>   struct ntfa_reader_hdl_rec *next;/* next pointer for the 
>> list in ntfa_cb_t */
>>   struct ntfa_client_hdl_rec *parent_hdl;/* Back Pointer to 
>> the client instantiation */
>>   } ntfa_reader_hdl_rec_t;
>> @@ -114,24 +119,35 @@ typedef struct ntfa_client_hdl_rec {
>>   ntfa_reader_hdl_rec_t *reader_list;
>>   SYSF_MBX mbx;/* priority q mbx b/w MDS & Library */
>>   struct ntfa_client_hdl_rec *next;/* next pointer for the 
>> list in ntfa_cb_t */
>> +bool valid;/* handle is valid if it's known by NTF 
>> server, used for headless hydra */
>> +SaVersionT version; /* the API version is being used by client, 
>> used for recover after headless */
>>   } ntfa_client_hdl_rec_t;
>>
>>   /*
>>* The NTFA control block is the master anchor structure for all NTFA
>>* instantiations within a process.
>>*/
>> +typedef enum {
>> +NTFA_NTFSV_NONE = 0,
>> +NTFA_NTFSV_DOWN,
>> +NTFA_NTFSV_NO_ACTIVE,
>> +NTFA_NTFSV_NEW_ACTIVE,
>> +NTFA_NTFSV_UP
>> +}ntfa_ntfsv_state_t;
>> +
>>   typedef struct {
>>   pthread_mutex_t cb_lock;/* CB lock */
>>   ntfa_client_hdl_rec_t *client_list;/* NTFA client handle 
>> database */
>>   ntfa_reader_hdl_rec_t *reader_list;
>>   MDS_HDL mds_hdl;/* MDS handle */
>>   MDS_DEST ntfs_mds_dest;/* NTFS absolute/virtual address */
>> -int ntfs_up;/* Indicate that MDS subscription
>> - * is complete */
>> +
>>   /* NTFS NTFA sync params */
>>   int ntfs_sync_awaited;
>>   NCS_SEL_OBJ ntfs_sync_sel;
>>   SaUint32T ntf_var_data_limit;/* max allowed 
>> variableDataSize */
>> +/* NTF Server state */
>> +ntfa_ntfsv_state_t ntfa_ntfsv_state;
>>   } ntfa_cb_t;
>>
>>   /* ntfa_saf_api.c */
>> @@ -149,7 +165,7 @@ extern void ntfsv_ntfa_evt_free(struct n
>>
>>   /* ntfa_init.c */
>>   e

Re: [devel] [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2

2016-02-15 Thread minh chau
Hi Nagu,

One thing that can help us to reproduce your problems, that can you 
attach to the ticket the models you are using for test?

Thanks,
Minh

On 15/02/16 19:32, Nagendra Kumar wrote:
> Hi Gary,
>   I am using the patch tar sent by Minh(9 Feb on devel list) and I using 
> these on same change set #7280 mentioned by Minh. So, please contact him for 
> any clarifications.
>
> Are you finding mismatch in the traces attached in the ticket #1620 (for many 
> test cases) and source code of Amfd anf Amfnd ?
>
> BTW, I am attaching the tar sent by Minh and how I applied patches on top of 
> #7280. Please note 010_log_1179.patch, I have taken from my repo as the tar 
> sent by Minh was not having correct log patch for 1179. So, ideally, Amf 
> patches should be the same, please check that. I enabled cloud feature 
> (IMMSV_SC_ABSENCE_ALLOWED) manually.
> ==
> patch -p1 < /tmp/sf_cloud_resilience_integration/777_osaftimer_2.diff
> patch -p1 <  ../OpensafHeadless/patches/010_log_1179.patch
> patch -p1 < /tmp/sf_cloud_resilience_integration/1620_README_V2.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1620_amfd_V3.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1620_amfnd_V3.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1180_ntf_agent.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1180_ntf_libs_common.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1180_ntf_readme.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/180_ntf_test.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1180_ntf_tools.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1620_common_libs_V2.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1620_config.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1621_ckpt.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_1.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_2.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_3.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_4.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_5.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_6.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1625_imm_7_compile_err.diff
> patch -p1 < /tmp/sf_cloud_resilience_integration/1646_clm.patch
> ==
>
> I manually compared installed Amfd and Amfnd binary files with binary files 
> created in source code repo while compiling. I compiled again and they are 
> the same. All the patches are applied.
> So, please check from your side and confirm me if I am making any mistake?
>
> Thanks
> -Nagu
>> -Original Message-
>> From: Gary Lee [mailto:gary@dektech.com.au]
>> Sent: 15 February 2016 13:36
>> To: Nagendra Kumar
>> Cc: minh chau; hans.nordeb...@ericsson.com; Praveen Malviya; opensaf-
>> de...@lists.sourceforge.net
>> Subject: Re: [devel] [PATCH 0 of 5] Review Request for amf: Add support for
>> cloud resilience [#1620] V2
>>
>> Hi Nagu
>>
>> I think we need to make sure we’re all looking at the same source code.
>>
>> I have trouble recreating some of the problems you’ve seen, but I see other
>> problems.
>>
>> Perhaps we can set up a fork of opensaf-staging on source forge, and check
>> in the patches?
>>
>> Thanks
>> Gary
>>
>>
>>> On 15 Feb 2016, at 4:00 PM, Gary Lee  wrote:
>>>
>>> Hi Nagu
>>>
>>> Just wanted to confirm that when you attach gdb to a process, the process
>> is amf_demo, and not amfnd?
>>> Thanks
>>> Gary
>>>
>>>> On 13 Feb 2016, at 1:35 AM, Nagendra Kumar 
>> wrote:
>>>> TC #27: Same configuration as TC #24:
>>>> Add a new Csi in running demo appl: Keep gdb in SU1 comp in
>> amf_csi_set_callback and add new csi to existing si. Stop controller, and 
>> then
>> respond from gdb. Start controller. Only Act assignment is given to SU1
>> component. Standby csi assignment is not given to SU2 component:
>> safCSIComp=safComp=AmfDemo\,safSu=SU1\,safSg=AmfDemo\,safApp=Am
>> fDemo1
>>>> ,safCsi=AmfDemo,safSi=AmfDemo,safApp=AmfDemo1
>>>>
>> safCSIComp=safComp=AmfDemo\,safSu=SU2\,safSg=AmfDemo\,safApp=Am
>> fDemo1
>>>> ,safCsi=AmfDemo,safSi=AmfDemo,safApp=AmfDemo1
>>>>
>> safCSIComp=safComp=AmfDemo\,safSu=SU1\,safSg=AmfDemo\,safApp=Am
>> fDemo1
>>>> ,safCsi=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>>>>

[devel] [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2

2016-02-19 Thread minh chau

Hi Nagu,

Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be important, 
plus some patches that we're trying to fix the issues


1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes 
mismatch of objects in amfd/imm vs amfnd:
- amfd's trying to tolerate the mismatch or the worst case that amfd 
orders node reboot to the last payload in cluster to avoid amfd cyclic crash


2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd change SU 
oper state to DISABLED, which is wrong since no fault happens on SU.
We can raise ticket on non-headless code base. Though we think the patch 
1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now
TC#13 has similar problem, but provided trace is only for amfd, so we 
don't know where amfnd changed SU oper state to DISABLED. And we haven't 
been able to reproduce this problem in TC#13 so far


3. Problem in TC #16
It seems the fault lies in the base code when the system is not 
headless. The admin state should not stay in unlocked state. We can 
raise a ticket  on the current non-headless code later.


4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, #22, #26)
Have patch for this, please apply fix.patch

5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, 
#19, #20, #21)

Have patch for this, please apply fix.patch

6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had nodegroup 
pushed. So we missed it.

Please apply the patch 1620_amfd_adjust_interm_admin_state.diff

7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end of email

8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user creates 
extra csi just before decides to go headless. Fix is on going


9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so its 
assignment object is still in amfnd at the time recovery. The patch just 
ignores to re-create this non-existed csi, please apply 
1620_amfd_ignore_nonexisted_csi.diff


10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply 
1620_amfd_add_su_op_list_delayed_sidep.diff


11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR 
safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)


Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 
should be fixed by attached patches)


The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't 
planned to support it since haven't seen headless user using those 
model, so they should be buggy. We added this limitation to README, or 
it should be an enhancement in future.


So we're still working on some remaining TCs (#2, #3, #4, #15, #27, #41, 
#42), alarm/notification related issues, and testing the patches.
If you find any other problems (or if you are not too busy to help 
reproduce TC#2, #12, #13, #14), the traces are much helpful though it 
would be nicer that we can have your testing model? (so we can quickly 
know which attr is on/off)


Thanks,
Minh


Opinion on TC#25

TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd decides to 
remove QUIESCED assignment of safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, 
because the gdb still hangs the

quiesced csi_set_callback

Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> 
avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'


- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> 
avnd_evt_avd_info_su_si_assign_evh: 
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> 
avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> 
avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'


- the gdb releases csi_set_callback, the quiesced assignment sequence 
continues, it's finished and report to amfd
Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> 
avnd_di_susi_resp_send: Sending Resp 
su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, 
si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1


Then amfnd pulls out the su_si_del which is buffered and continue the 
removal assignment sequence. This sequence finishes and amfnd report to amfd
Feb 11 19:46:27.329483 osafamfnd [16881:di.cc:0857] TR Sending. 
msg_id'3', node_id'131855', msg_act'4', 
su'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', si'', ha_state'3', 
error'1', single_csi'0'


- At amfd, upon receiving the report of quiesced assignment completion, 
amfd decides to remove

quiesced assignment of SU1
Feb 11 19:46:27.086796 osa

Re: [devel] [PATCH 0 of 5] Review Request for amf: Add support for cloud resilience [#1620] V2

2016-02-22 Thread minh chau

Hi Nagu,

Attached patch is for TC 41, 42.
We have noticed one bug in sidep, will update it soon.

Thanks,
Minh

On 22/02/16 23:48, Hans Nordebäck wrote:

Hi,

please see enclosed patch for TC #1, #6, #7, #8, #9, #10 and 
#11/Thanks HansN


On 02/19/2016 10:09 AM, minh chau wrote:

Hi Nagu,

Thanks for your testing.
Below is our investigation from TC1 - TC31 which seem to be 
important, plus some patches that we're trying to fix the issues


1. IMM one payload limitation (TC #1, #6, #7, #8, #9, #10, #11)
Discussion is on-going. When we hit the limitation, which causes 
mismatch of objects in amfd/imm vs amfnd:
- amfd's trying to tolerate the mismatch or the worst case that amfd 
orders node reboot to the last payload in cluster to avoid amfd 
cyclic crash


2. Suspicious setting SU oper state in amfnd (TC #13, #24)
According the trace in TC24, after unlock-in nodegroup, amfnd change 
SU oper state to DISABLED, which is wrong since no fault happens on SU.
We can raise ticket on non-headless code base. Though we think the 
patch 1620_amfnd_dont_disabled_healthy_su.diff can fix TC #24 for now
TC#13 has similar problem, but provided trace is only for amfd, so we 
don't know where amfnd changed SU oper state to DISABLED. And we 
haven't been able to reproduce this problem in TC#13 so far


3. Problem in TC #16
It seems the fault lies in the base code when the system is not 
headless. The admin state should not stay in unlocked state. We can 
raise a ticket  on the current non-headless code later.


4. Amfnd coredump at "di.cc:850: avnd_di_susi_resp_send" (TC #18, 
#22, #26)

Have patch for this, please apply fix.patch

5. Amfnd crashes at "avnd_comp_cmplete_all_assignment" (TC #14, #17, 
#19, #20, #21)

Have patch for this, please apply fix.patch

6. Support Nodegroup handling in delayed failover (TC #23)
At the time we developed AMF cloud resilience, we haven't had 
nodegroup pushed. So we missed it.

Please apply the patch 1620_amfd_adjust_interm_admin_state.diff

7. Problem in TC #25
We think it's not really a fault. Please see our opinion at the end 
of email


8. Delayed failover needs to check csi level (TC #27)
Fault reproducible, however it seems a rare use case where user 
creates extra csi just before decides to go headless. Fix is on going


9. Recover non existed csi (TC #28)
CSI had been deleted in IMM, but there's delay at application so its 
assignment object is still in amfnd at the time recovery. The patch 
just ignores to re-create this non-existed csi, please apply 
1620_amfd_ignore_nonexisted_csi.diff


10. Delayed si dep issue (TC #29, #30)
Have patch for this, please apply 
1620_amfd_add_su_op_list_delayed_sidep.diff


11. About TC #31, test case has fault ?
The trace shows A sponsor C, the test lock B and expect C is removed
Line 15815: Feb 12 20:00:25.403574 osafamfd [7989:imm.cc:0837] TR 
safDepend=safSi=A\,safApp=Test\,safSi=C,safApp=Test(51)


Test cases are unable to reproduce: TC #2, #12, #13, #14 (#13, #14 
should be fixed by attached patches)


The tests reported on TC #32 to TC #40 on Npm, Nway, that we haven't 
planned to support it since haven't seen headless user using those 
model, so they should be buggy. We added this limitation to README, 
or it should be an enhancement in future.


So we're still working on some remaining TCs (#2, #3, #4, #15, #27, 
#41, #42), alarm/notification related issues, and testing the patches.
If you find any other problems (or if you are not too busy to help 
reproduce TC#2, #12, #13, #14), the traces are much helpful though it 
would be nicer that we can have your testing model? (so we can 
quickly know which attr is on/off)


Thanks,
Minh


Opinion on TC#25

TC25:
- At the time SC1 restarts, amfd adjusts the assignment. amfd decides 
to remove QUIESCED assignment of 
safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, because the gdb still hangs the

quiesced csi_set_callback

Feb 11 19:46:23.329326 osafamfd [28309:sgproc.cc:2328] >> 
avd_sg_su_si_del_snd: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'


- amfnd-PL3 receives su_si_del msg, buffer it
Feb 11 19:46:23.574645 osafamfnd [16881:su.cc:0376] >> 
avnd_evt_avd_info_su_si_assign_evh: 
'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574657 osafamfnd [16881:susm.cc:0189] >> 
avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Feb 11 19:46:23.574667 osafamfnd [16881:sidb.cc:0937] >> 
avnd_su_siq_rec_add: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'


- the gdb releases csi_set_callback, the quiesced assignment sequence 
continues, it's finished and report to amfd
Feb 11 19:46:27.327908 osafamfnd [16881:di.cc:0816] >> 
avnd_di_susi_resp_send: Sending Resp 
su=safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1, 
si=safSi=AmfDemo,safApp=AmfDemo1, curr_state=3, prv_state=1


Then amfnd pulls out the su_si_del which is buffered and c

Re: [devel] [PATCH 2 of 5] amfd: Add support for cloud resilience at director [#1620] V2

2016-02-25 Thread minh chau
Hi Praveen,

We have been lost in emails of #1620, sorry for late reply.
V4 has been sent out though your comments here are still valid on V4.
So please find our comments inline with [Gary] and [Minh]

Thanks,
Gary/Minh

On 11/02/16 22:15, praveen malviya wrote:
> Hi Minh,
>
> Please find some initial comments on this patch marked with [Praveen].
>
> Thanks,
> Praveen
>
> On 20-Jan-16 9:03 AM, Minh Hon Chau wrote:
>> osaf/services/saf/amf/amfd/cluster.cc|   69 -
>>   osaf/services/saf/amf/amfd/comp.cc   |8 +-
>>   osaf/services/saf/amf/amfd/csi.cc|  107 +++
>>   osaf/services/saf/amf/amfd/imm.cc|   58 
>>   osaf/services/saf/amf/amfd/include/cb.h  |5 +
>>   osaf/services/saf/amf/amfd/include/cluster.h |1 +
>>   osaf/services/saf/amf/amfd/include/csi.h |2 +
>>   osaf/services/saf/amf/amfd/include/db_template.h |1 +
>>   osaf/services/saf/amf/amfd/include/evt.h |3 +
>>   osaf/services/saf/amf/amfd/include/mds.h |7 +-
>>   osaf/services/saf/amf/amfd/include/msg.h |2 +-
>>   osaf/services/saf/amf/amfd/include/node.h|3 +
>>   osaf/services/saf/amf/amfd/include/proc.h|7 +
>>   osaf/services/saf/amf/amfd/include/sg.h  |   16 +-
>>   osaf/services/saf/amf/amfd/include/si.h  |1 +
>>   osaf/services/saf/amf/amfd/include/susi.h|3 +
>>   osaf/services/saf/amf/amfd/include/timer.h   |1 +
>>   osaf/services/saf/amf/amfd/include/util.h|2 +-
>>   osaf/services/saf/amf/amfd/main.cc   |   24 +
>>   osaf/services/saf/amf/amfd/mds.cc|4 +-
>>   osaf/services/saf/amf/amfd/ndfsm.cc  |  325 
>> ++-
>>   osaf/services/saf/amf/amfd/ndmsg.cc  |   18 +-
>>   osaf/services/saf/amf/amfd/ndproc.cc |  103 +++-
>>   osaf/services/saf/amf/amfd/node.cc   |   17 +-
>>   osaf/services/saf/amf/amfd/sg.cc |   57 
>>   osaf/services/saf/amf/amfd/sg_2n_fsm.cc  |  140 +
>>   osaf/services/saf/amf/amfd/sg_nored_fsm.cc   |6 +
>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |   24 +
>>   osaf/services/saf/amf/amfd/sg_nway_fsm.cc|   24 +
>>   osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc |6 +
>>   osaf/services/saf/amf/amfd/sgproc.cc |   47 ++-
>>   osaf/services/saf/amf/amfd/si.cc |   43 ++-
>>   osaf/services/saf/amf/amfd/siass.cc  |  121 
>>   osaf/services/saf/amf/amfd/su.cc |   19 +-
>>   34 files changed, 1207 insertions(+), 67 deletions(-)
>>
>>
>> Outlined changes:
>> . node_up_msg event handling has changed so that amfd can collect
>> the sync information sent from amfnd
>> . Node Sync timer is introduced as a window of amfnd sync from headless
>> . Failover may happens during headless, adjust_delayed_failover() to
>> balance the assignment in term of active/standby availability
>> . SI dependencies also can change due to assignment removal during 
>> headless
>> adjust_delayed_sidep() to update the si dependencies
>>
>> diff --git a/osaf/services/saf/amf/amfd/cluster.cc 
>> b/osaf/services/saf/amf/amfd/cluster.cc
>> --- a/osaf/services/saf/amf/amfd/cluster.cc
>> +++ b/osaf/services/saf/amf/amfd/cluster.cc
>> @@ -25,6 +25,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>
>>   /* Singleton cluster object */
>>   static AVD_CLUSTER _avd_cluster;
>> @@ -52,6 +53,7 @@ AVD_CLUSTER *avd_cluster = &_avd_cluster
>>   void avd_cluster_tmr_init_evh(AVD_CL_CB *cb, AVD_EVT *evt)
>>   {
>>   TRACE_ENTER();
>> +AVD_SU *su = nullptr;
>>   saflog(LOG_NOTICE, amfSvcUsrName, "Cluster startup timeout, 
>> assigning SIs to SUs");
>>
>>   osafassert(evt->info.tmr.type == AVD_TMR_CL_INIT);
>> @@ -74,19 +76,84 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB
>>* system that are not NCS specific.
>>*/
>>
>> +/* The SI Dependency could be broken due to failover or 
>> instantiation/
>> + * termination failure during headless.
>> + * adjust_delayed_sidep() removes SI(s) assignment which has any
>> + * unassigned sponsored SI.
>> + *
>> + */
>> +if (cb->scs_absence_max_duration > 0) {
>> +adjust_delayed_sidep();
>> +}
>> +
>>   for (std::map::const_iterator it = 
>> sg_db->begin();
>>   it != sg_db->end(); it++) {
>>   AVD_SG *i_sg = it->second;
>>   if ((i_sg->list_of_su.empty() == true) || 
>> (i_sg->sg_ncs_spec == true)) {
>>   continue;
>>   }
>> -i_sg->realign(cb, i_sg);
>> +
>> +/* If hydra is enabled and su failover happened during 
>> headless,
>> + * currently only the active assignment is removed but the 
>> standby
>> + * assignment has not been switched to active.
>> + * adjust_delayed_failover() finds the standby assignment being
>> +  

Re: [devel] [PATCH 03 of 15] amfd: Add support for cloud resilience at director [#1620]

2016-02-25 Thread minh chau
Hi,

I made a mistake in rebase the patch, that I reverted fix of #1595 (the 
last diff)
I'll remove that last diff in next version.

Thanks,
Minh

On 25/02/16 19:44, Minh Hon Chau wrote:
>   osaf/services/saf/amf/amfd/cluster.cc|   48 +++-
>   osaf/services/saf/amf/amfd/comp.cc   |8 +-
>   osaf/services/saf/amf/amfd/csi.cc|  105 +++
>   osaf/services/saf/amf/amfd/imm.cc|   58 
>   osaf/services/saf/amf/amfd/include/cb.h  |5 +
>   osaf/services/saf/amf/amfd/include/cluster.h |1 +
>   osaf/services/saf/amf/amfd/include/csi.h |2 +
>   osaf/services/saf/amf/amfd/include/db_template.h |1 +
>   osaf/services/saf/amf/amfd/include/evt.h |3 +
>   osaf/services/saf/amf/amfd/include/mds.h |7 +-
>   osaf/services/saf/amf/amfd/include/msg.h |2 +-
>   osaf/services/saf/amf/amfd/include/node.h|3 +
>   osaf/services/saf/amf/amfd/include/proc.h|7 +
>   osaf/services/saf/amf/amfd/include/sg.h  |1 -
>   osaf/services/saf/amf/amfd/include/si.h  |1 +
>   osaf/services/saf/amf/amfd/include/susi.h|3 +
>   osaf/services/saf/amf/amfd/include/timer.h   |1 +
>   osaf/services/saf/amf/amfd/main.cc   |   24 +
>   osaf/services/saf/amf/amfd/mds.cc|4 +-
>   osaf/services/saf/amf/amfd/ndfsm.cc  |  325 
> ++-
>   osaf/services/saf/amf/amfd/ndmsg.cc  |   18 +-
>   osaf/services/saf/amf/amfd/ndproc.cc |  103 +++-
>   osaf/services/saf/amf/amfd/node.cc   |   17 +-
>   osaf/services/saf/amf/amfd/sgproc.cc |   47 ++-
>   osaf/services/saf/amf/amfd/si.cc |   43 ++-
>   osaf/services/saf/amf/amfd/siass.cc  |  121 
>   osaf/services/saf/amf/amfd/su.cc |   19 +-
>   27 files changed, 911 insertions(+), 66 deletions(-)
>
>
> Outlined changes:
> . node_up_msg event handling has changed so that amfd can collect
> the sync information sent from amfnd
> . Node Sync timer is introduced as a window of amfnd sync from headless
>
> diff --git a/osaf/services/saf/amf/amfd/cluster.cc 
> b/osaf/services/saf/amf/amfd/cluster.cc
> --- a/osaf/services/saf/amf/amfd/cluster.cc
> +++ b/osaf/services/saf/amf/amfd/cluster.cc
> @@ -25,6 +25,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   
>   /* Singleton cluster object */
>   static AVD_CLUSTER _avd_cluster;
> @@ -52,6 +53,7 @@ AVD_CLUSTER *avd_cluster = &_avd_cluster
>   void avd_cluster_tmr_init_evh(AVD_CL_CB *cb, AVD_EVT *evt)
>   {
>   TRACE_ENTER();
> + AVD_SU *su = nullptr;
>   saflog(LOG_NOTICE, amfSvcUsrName, "Cluster startup timeout, assigning 
> SIs to SUs");
>   
>   osafassert(evt->info.tmr.type == AVD_TMR_CL_INIT);
> @@ -80,13 +82,57 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB
>   if ((i_sg->list_of_su.empty() == true) || (i_sg->sg_ncs_spec == 
> true)) {
>   continue;
>   }
> - i_sg->realign(cb, i_sg);
> +
> + if (i_sg->sg_fsm_state == AVD_SG_FSM_STABLE)
> + i_sg->realign(cb, i_sg);
> + }
> +
> + if (cb->scs_absence_max_duration > 0) {
> + TRACE("check if any SU is auto repair enabled");
> +
> + for (std::map::const_iterator it = 
> su_db->begin();
> + it != su_db->end(); it++) {
> +
> + su = it->second;
> +
> + if (su->list_of_susi == nullptr &&
> + su->su_on_node != nullptr &&
> + su->su_on_node->saAmfNodeOperState == 
> SA_AMF_OPERATIONAL_ENABLED) {
> + su_try_repair(su);
> + }
> + }
>   }
>   
>   done:
>   TRACE_LEAVE();
>   }
>   
> +/
> + *  Name  : avd_node_sync_tmr_evh
> + *
> + *  Description   : This is node sync timer expiry routine handler
> + *
> + *  Arguments : cb -  AvD cb
> + *  evt-  ptr to the received event
> + *
> + *  Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE
> + *
> + *  Notes : None.
> + ***/
> +void avd_node_sync_tmr_evh(AVD_CL_CB *cb, AVD_EVT *evt)
> +{
> + TRACE_ENTER();
> +
> + osafassert(evt->info.tmr.type == AVD_TMR_NODE_SYNC);
> + LOG_NO("NodeSync timeout");
> +
> + // Setting true here to indicate the node sync window has closed
> + // Further node up message will be treated specially
> + cb->node_sync_window_closed = true;
> +
> + TRACE_LEAVE();
> +}
> +
>   static void ccb_apply_modify_hdlr(struct CcbUtilOperationData *opdata)
>   {
>   const SaImmAttrModificationT_2 *attr_mod;
> diff --git a/osaf/services/saf/amf/amfd/comp.cc 
> b/osa

Re: [devel] [PATCH 1 of 5] NTF: Add support cloud resilience for NTF Agent [#1180]

2016-02-28 Thread minh chau
Hi Vu,

Please see comments in line with [Minh]

Thanks,
Minh

On 25/02/16 17:57, Vu Minh Nguyen wrote:
> Hi Minh,
>
> I have few comments below [Vu] and one question.
>
> I see, in some places,  NTF APIs not always return TRY_AGAIN if both SCs
> down.
> I am not sure if I feel correctly or not.
>
> E.g: In `saNtfNotificationSend` API
> When the client thread comes to code line `ntfa_mds_msg_sync_send()`,
> headless occurs. If this is the case, the API may return TIMEOUT.
[Minh] I think we should return TIMEOUT here, respect the fact that mds 
call get timeout. client shall also try again with timeout (#1607)
> Regards, Vu.
>
>
>> -Original Message-
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: Wednesday, December 23, 2015 11:02 AM
>> To: lennart.l...@ericsson.com; praveen.malv...@oracle.com;
>> vu.m.ngu...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [PATCH 1 of 5] NTF: Add support cloud resilience for NTF Agent
> [#1180]
>> osaf/libs/agents/saf/ntfa/ntfa.h  |   31 +-
>> osaf/libs/agents/saf/ntfa/ntfa_api.c  |  672
> +++
>> --
>> osaf/libs/agents/saf/ntfa/ntfa_mds.c  |6 +-
>> osaf/libs/agents/saf/ntfa/ntfa_util.c |  465 ++-
>> 4 files changed, 1022 insertions(+), 152 deletions(-)
>>
>>
>> The patch contains support for cloud resilience feature
>> in NTF Agent code. Please refer README.HYDRA for content
>> of the changes
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa.h
> b/osaf/libs/agents/saf/ntfa/ntfa.h
>> --- a/osaf/libs/agents/saf/ntfa/ntfa.h
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa.h
>> @@ -91,6 +91,7 @@ typedef struct ntfa_filter_hdl_rec {
>> typedef struct subscriberList {
>>  SaNtfHandleT subscriberListNtfHandle;
>>  SaNtfSubscriptionIdT subscriberListSubscriptionId;
>> +ntfsv_filter_ptrs_t filters; /* remember the filters used by this
>> subscriber */
>>  struct subscriberList *prev;
>>  struct subscriberList *next;
>> } ntfa_subscriber_list_t;
>> @@ -100,6 +101,10 @@ typedef struct ntfa_reader_hdl_rec {
>>  unsigned int reader_id; /* handle value returned by NTFS for this
> client
>> */
>>  SaNtfHandleT ntfHandle;
>>  unsigned int reader_hdl;/* READER handle from handle mgr */
>> +
>> +ntfsv_filter_ptrs_t filters; /* remember the filters used by this
> reader
>> */
>> +SaNtfSearchCriteriaT searchCriteria; /* remember the searchCriteria
> for
>> recovery */
>> +
>>  struct ntfa_reader_hdl_rec *next;   /* next pointer for the list
> in
>> ntfa_cb_t */
>>  struct ntfa_client_hdl_rec *parent_hdl; /* Back Pointer to the
> client
>> instantiation */
>> } ntfa_reader_hdl_rec_t;
>> @@ -114,24 +119,35 @@ typedef struct ntfa_client_hdl_rec {
>>  ntfa_reader_hdl_rec_t *reader_list;
>>  SYSF_MBX mbx;   /* priority q mbx b/w MDS & Library */
>>  struct ntfa_client_hdl_rec *next;   /* next pointer for the list
> in
>> ntfa_cb_t */
>> +bool valid; /* handle is valid if it's known by NTF
> server,
>> used for headless hydra */
>> +SaVersionT version; /* the API version is being used by client, used
> for
>> recover after headless */
>> } ntfa_client_hdl_rec_t;
>>
>> /*
>>   * The NTFA control block is the master anchor structure for all NTFA
>>   * instantiations within a process.
>>   */
>> +typedef enum {
>> +NTFA_NTFSV_NONE = 0,
>> +NTFA_NTFSV_DOWN,
>> +NTFA_NTFSV_NO_ACTIVE,
>> +NTFA_NTFSV_NEW_ACTIVE,
>> +NTFA_NTFSV_UP
>> +}ntfa_ntfsv_state_t;
>> +
>> typedef struct {
>>  pthread_mutex_t cb_lock;/* CB lock */
>>  ntfa_client_hdl_rec_t *client_list; /* NTFA client handle
> database
>> */
>>  ntfa_reader_hdl_rec_t *reader_list;
>>  MDS_HDL mds_hdl;/* MDS handle */
>>  MDS_DEST ntfs_mds_dest; /* NTFS absolute/virtual address */
>> -int ntfs_up;/* Indicate that MDS subscription
>> - * is complete */
>> +
>>  /* NTFS NTFA sync params */
>>  int ntfs_sync_awaited;
>>  NCS_SEL_OBJ ntfs_sync_sel;
>>  SaUint32T ntf_var_data_limit;   /* max allowed variableDataSize */
>> +/* NTF Server state */
>> +ntfa_ntfsv_state_t ntfa_ntfsv_state;
>> } ntfa_cb_t;
>>
>> /* ntfa_saf_api.c */
>> @@ -149,7 +165,7 @@ extern void ntfsv_ntfa_evt_free(struct n
>>
>> /* ntfa_init.c */
>> extern unsigned int ntfa_startup(void);
>> -extern unsigned int ntfa_shutdown(void);
>> +extern unsigned int ntfa_shutdown(bool forced);
>>
>> /* ntfa_hdl.c */
>> extern SaAisErrorT ntfa_hdl_cbk_dispatch(ntfa_cb_t *,
> ntfa_client_hdl_rec_t *,
>> SaDispatchFlagsT);
>> @@ -159,6 +175,7 @@ extern ntfa_notification_hdl_rec_t *ntfa
>> extern ntfa_filter_hdl_rec_t
> *ntfa_filter_hdl_rec_add(ntfa_client_hdl_rec_t
>> **hdl_rec);
>> extern void ntfa_hdl_list_del(ntfa_client_hdl_rec_t **);
>> extern uint32_t ntfa_hdl_rec_del(ntfa_client_hdl_rec_t **,
>> ntfa_client_hdl_rec_t *);
>> +extern void ntfa_hdl_rec_force_del(ntfa_client_

Re: [devel] [PATCH 04 of 15] amfnd: Add support for cloud resilience at node director [#1620]

2016-03-01 Thread minh chau
Hi Praveen,

If node_up of amfnd comes after node sync timer expires, amfd will send 
reboot message to that amfnd, regardless of susi states.
Sending reboot message in avd_comp_pres_state_set() if comp is 
inst/term-failed has already been in code base of #1620.
The change in #1620 that marks *node->reboot = true* in 
avd_comp_pres_state_set(), which should be called when amfd recreate 
compcsi(s) from all amfnd after headless
Then in avd_node_up_evh(), the node that is marked "reboot" as true, 
will be rebooted.
In other words, if any comps are dropped into inst/term-failed state 
during headless subjected to node-failfast, the node hosting 
inst/term-failed su will be rebooted after headless.

I guess one of your questions on V3 has related to this.

Thanks,
Minh

On 02/03/16 17:17, praveen malviya wrote:
> Hi Minh,
>
> One query on patch 03.
> From headless state when first controller joins,   in 
> avd_cluster_tmr_init_evh(), SG is being realigned. During realignment 
> AMF will take care of new assignments but not of those SUSIs whose 
> FSMs are in transition state.
> Is AMF rebooting the node which hosts SUs of these SUSIs after node 
> sync timer expires? I am seeing reboot message being sent from 
> avd_comp_pres_state_set() but I think that is not for this purpose.
>
>
> Thanks,
> Praveen
>
> On 25-Feb-16 2:14 PM, Minh Hon Chau wrote:
>> osaf/services/saf/amf/amfnd/clc.cc  |  100 +++--
>>   osaf/services/saf/amf/amfnd/clm.cc  |   11 +-
>>   osaf/services/saf/amf/amfnd/comp.cc |   42 ++-
>>   osaf/services/saf/amf/amfnd/compdb.cc   |   45 ++-
>>   osaf/services/saf/amf/amfnd/di.cc   |  419 
>> +++-
>>   osaf/services/saf/amf/amfnd/err.cc  |  112 +-
>>   osaf/services/saf/amf/amfnd/evt.cc  |2 +
>>   osaf/services/saf/amf/amfnd/hcdb.cc |8 +-
>>   osaf/services/saf/amf/amfnd/include/avnd_cb.h   |   13 +-
>>   osaf/services/saf/amf/amfnd/include/avnd_comp.h |   17 +-
>>   osaf/services/saf/amf/amfnd/include/avnd_di.h   |4 +
>>   osaf/services/saf/amf/amfnd/include/avnd_evt.h  |2 +
>>   osaf/services/saf/amf/amfnd/include/avnd_mds.h  |4 +-
>>   osaf/services/saf/amf/amfnd/include/avnd_proc.h |1 +
>>   osaf/services/saf/amf/amfnd/include/avnd_su.h   |4 +-
>>   osaf/services/saf/amf/amfnd/include/avnd_tmr.h  |1 +
>>   osaf/services/saf/amf/amfnd/include/avnd_util.h |4 +
>>   osaf/services/saf/amf/amfnd/main.cc |  103 +-
>>   osaf/services/saf/amf/amfnd/mds.cc  |   19 +-
>>   osaf/services/saf/amf/amfnd/sidb.cc |9 +-
>>   osaf/services/saf/amf/amfnd/su.cc   |   39 +-
>>   osaf/services/saf/amf/amfnd/susm.cc |  103 +++--
>>   osaf/services/saf/amf/amfnd/tmr.cc  |1 +
>>   osaf/services/saf/amf/amfnd/util.cc |  153 -
>>   24 files changed, 1059 insertions(+), 157 deletions(-)
>>
>>
>> Outline changes:
>> . amfnd does not reboot if amfd is down
>> . componentRestart and suRestart is supported, the node reboot if
>> any escalation to component/su failover
>> . SC absence timer is introduced, node will reboot if timeout
>> . amfnd sends sync information if amfd is up after headless
>>
>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc 
>> b/osaf/services/saf/amf/amfnd/clc.cc
>> --- a/osaf/services/saf/amf/amfnd/clc.cc
>> +++ b/osaf/services/saf/amf/amfnd/clc.cc
>> @@ -454,7 +454,7 @@ uint32_t avnd_evt_comp_pres_fsm_evh(AVND
>>
>>   if ((is_uninst == true) &&
>>   (comp->pres == SA_AMF_PRESENCE_INSTANTIATING))
>> -avnd_su_pres_state_set(comp->su, 
>> SA_AMF_PRESENCE_INSTANTIATING);
>> +avnd_su_pres_state_set(cb, comp->su, 
>> SA_AMF_PRESENCE_INSTANTIATING);
>>
>>   done:
>>   TRACE_LEAVE2("%u", rc);
>> @@ -767,7 +767,7 @@ uint32_t avnd_comp_clc_fsm_run(AVND_CB *
>>   TRACE("Term state is NODE_FAILOVER, event '%s'", 
>> pres_state_evt[ev]);
>>   switch (ev) {
>>   case AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_SUCC:
>> -avnd_comp_pres_state_set(comp, 
>> SA_AMF_PRESENCE_UNINSTANTIATED);
>> +avnd_comp_pres_state_set(cb, comp, 
>> SA_AMF_PRESENCE_UNINSTANTIATED);
>>   if (all_app_comps_terminated()) {
>>   AVND_SU *tmp_su;
>>   cb->term_state = 
>> AVND_TERM_STATE_NODE_FAILOVER_TERMINATED;
>> @@ -924,8 +924,10 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>>
>>   TRACE_1("Component restart not through admin operation");
>>   /* inform avd of the change in restart count */
>> -avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, 
>> saAmfCompRestartCount_ID,
>> +if (cb->is_avd_down == false) {
>> +avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, 
>> saAmfCompRestartCount_ID,
>>   &comp->name, comp->err_info.restart_cnt);
>> +}
>>   }
>>   /* reset the admin-oper flag to false */
>>   if ((

Re: [devel] [PATCH 04 of 15] amfnd: Add support for cloud resilience at node director [#1620]

2016-03-01 Thread minh chau
Hi Praveen,

Please see comments in line [Minh]

Thanks,
Minh

On 02/03/16 18:12, praveen malviya wrote:
>
>
> On 02-Mar-16 12:26 PM, minh chau wrote:
>> Hi Praveen,
>>
>> If node_up of amfnd comes after node sync timer expires, amfd will send
>> reboot message to that amfnd, regardless of susi states.
>> Sending reboot message in avd_comp_pres_state_set() if comp is
>> inst/term-failed has already been in code base of #1620.
>> The change in #1620 that marks *node->reboot = true* in
>> avd_comp_pres_state_set(), which should be called when amfd recreate
>> compcsi(s) from all amfnd after headless
>> Then in avd_node_up_evh(), the node that is marked "reboot" as true,
>> will be rebooted.
>> In other words, if any comps are dropped into inst/term-failed state
>> during headless subjected to node-failfast, the node hosting
>> inst/term-failed su will be rebooted after headless.
>>
>> I guess one of your questions on V3 has related to this.
>>
> I did not mean the case of inst and term failed of su.
> The SUSI may be in transition state because of admin operations and 
> system becomes headless. So when first controller comes up, realign 
> logic will not be helpful and will not correct the state.
[Minh] You are right, and that's the reason we need the patch 06 - 
Support delayed failover, that will help to move the SUSI to the correct 
state.

Thanks,
Minh
>
> Thanks,
> Praveen
>> Thanks,
>> Minh
>>
>> On 02/03/16 17:17, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> One query on patch 03.
>>> From headless state when first controller joins,   in
>>> avd_cluster_tmr_init_evh(), SG is being realigned. During realignment
>>> AMF will take care of new assignments but not of those SUSIs whose
>>> FSMs are in transition state.
>>> Is AMF rebooting the node which hosts SUs of these SUSIs after node
>>> sync timer expires? I am seeing reboot message being sent from
>>> avd_comp_pres_state_set() but I think that is not for this purpose.
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 25-Feb-16 2:14 PM, Minh Hon Chau wrote:
>>>> osaf/services/saf/amf/amfnd/clc.cc |  100 +++--
>>>>   osaf/services/saf/amf/amfnd/clm.cc  |   11 +-
>>>>   osaf/services/saf/amf/amfnd/comp.cc |   42 ++-
>>>>   osaf/services/saf/amf/amfnd/compdb.cc   |   45 ++-
>>>>   osaf/services/saf/amf/amfnd/di.cc   |  419
>>>> +++-
>>>>   osaf/services/saf/amf/amfnd/err.cc  |  112 +-
>>>>   osaf/services/saf/amf/amfnd/evt.cc  |2 +
>>>>   osaf/services/saf/amf/amfnd/hcdb.cc |8 +-
>>>>   osaf/services/saf/amf/amfnd/include/avnd_cb.h   |   13 +-
>>>>   osaf/services/saf/amf/amfnd/include/avnd_comp.h |   17 +-
>>>>   osaf/services/saf/amf/amfnd/include/avnd_di.h   |4 +
>>>>   osaf/services/saf/amf/amfnd/include/avnd_evt.h  |2 +
>>>>   osaf/services/saf/amf/amfnd/include/avnd_mds.h  |4 +-
>>>>   osaf/services/saf/amf/amfnd/include/avnd_proc.h |1 +
>>>>   osaf/services/saf/amf/amfnd/include/avnd_su.h   |4 +-
>>>>   osaf/services/saf/amf/amfnd/include/avnd_tmr.h  |1 +
>>>>   osaf/services/saf/amf/amfnd/include/avnd_util.h |4 +
>>>>   osaf/services/saf/amf/amfnd/main.cc |  103 +-
>>>>   osaf/services/saf/amf/amfnd/mds.cc  |   19 +-
>>>>   osaf/services/saf/amf/amfnd/sidb.cc |9 +-
>>>>   osaf/services/saf/amf/amfnd/su.cc   |   39 +-
>>>>   osaf/services/saf/amf/amfnd/susm.cc |  103 +++--
>>>>   osaf/services/saf/amf/amfnd/tmr.cc  |1 +
>>>>   osaf/services/saf/amf/amfnd/util.cc |  153 -
>>>>   24 files changed, 1059 insertions(+), 157 deletions(-)
>>>>
>>>>
>>>> Outline changes:
>>>> . amfnd does not reboot if amfd is down
>>>> . componentRestart and suRestart is supported, the node reboot if
>>>> any escalation to component/su failover
>>>> . SC absence timer is introduced, node will reboot if timeout
>>>> . amfnd sends sync information if amfd is up after headless
>>>>
>>>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc
>>>> b/osaf/services/saf/amf/amfnd/clc.cc
>>>> --- a/osaf/services/saf/amf/amfnd/clc.cc
>>>> +++ b/osaf/services/saf/amf/amfnd/clc.

Re: [devel] [PATCH 04 of 15] amfnd: Add support for cloud resilience at node director [#1620]

2016-03-02 Thread minh chau
For instance, application can configure one component restart can lead 
to node failover, and this escalation path should work during headless 
the same way as in non-headless.
But if the escalation path that needs comp/su failover, amfnd will 
*disable* the faulty comp/su and recovery/repair shall be done once 
system controllers come back. In general, during headless, the fault 
isolation/recovery of amfnd should work as before until it needs system 
controller's presence. 'delayed failover' is that moving workload of 
standby to active (and in many other ha transitions) will be delayed 
until controllers are back

Thanks,
Minh
On 02/03/16 23:32, Anders Widell wrote:
> Isolation should happen immediately, but it is the recovery and repair 
> actions that can sometimes be postponed until the system controllers 
> are back.
>
> regards,
> Anders Widell
>
> On 03/02/2016 12:18 PM, Mathivanan Naickan Palanivelu wrote:
>> Thanks for the explanation. My query was independent of the mail 
>> thread and
>> Was generic to understand what 'delayed failover' terminology meant 
>> during the fault scenarios!
>> I probably wanted to state that a solution that does not isolates the 
>> faulty resource once a fault is detected,
>> would be against the general requirements of the fault management cycle!
>>
>> Cheers,
>> Mathi.
>>
>>
>>> -Original Message-
>>> From: Gary Lee [mailto:gary@dektech.com.au]
>>> Sent: Wednesday, March 02, 2016 4:30 PM
>>> To: Mathivanan Naickan Palanivelu
>>> Cc: minh.c...@dektech.com.au; opensaf-devel@lists.sourceforge.net;
>>> Nagendra Kumar; Praveen Malviya; hans.nordeb...@ericsson.com
>>> Subject: Re: [devel] [PATCH 04 of 15] amfnd: Add support for cloud 
>>> resilience
>>> at node director [#1620]
>>>
>>> Hi Mathi
>>>
>>> I think Minh has previously said "delayed failover" isn't the best
>>> description of what patch 6 is doing.
>>> Minh has previously described it better as "adjust HA assignment";
>>> moving transient states to states that
>>> realign() can work with. The transient states aren't necessarily
>>> caused by a component error. The SCs could have just disappeared in
>>> the middle of an operation. An alternative is to reboot payloads
>>> associated with the transient states which seems unnecessary given the
>>> payload is otherwise healthy.
>>>
>>> Thanks
>>> Gary
>>>
>>>
>>> Quoting Mathivanan Naickan Palanivelu :
>>>
>>>> Hi All,
>>>>
>>>> What is 'delayed failover'? That sounds against the principles of
>>>> 'software fault isolation'!?
>>>>
>>>> Thanks,
>>>> Mathi.
>>>>
>>>> - minh.c...@dektech.com.au wrote:
>>>>
>>>>> Hi Praveen,
>>>>>
>>>>> Please see comments in line [Minh]
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>> On 02/03/16 18:12, praveen malviya wrote:
>>>>>>
>>>>>> On 02-Mar-16 12:26 PM, minh chau wrote:
>>>>>>> Hi Praveen,
>>>>>>>
>>>>>>> If node_up of amfnd comes after node sync timer expires, amfd will
>>>>> send
>>>>>>> reboot message to that amfnd, regardless of susi states.
>>>>>>> Sending reboot message in avd_comp_pres_state_set() if comp is
>>>>>>> inst/term-failed has already been in code base of #1620.
>>>>>>> The change in #1620 that marks *node->reboot = true* in
>>>>>>> avd_comp_pres_state_set(), which should be called when amfd
>>>>> recreate
>>>>>>> compcsi(s) from all amfnd after headless
>>>>>>> Then in avd_node_up_evh(), the node that is marked "reboot" as
>>>>> true,
>>>>>>> will be rebooted.
>>>>>>> In other words, if any comps are dropped into inst/term-failed
>>>>> state
>>>>>>> during headless subjected to node-failfast, the node hosting
>>>>>>> inst/term-failed su will be rebooted after headless.
>>>>>>>
>>>>>>> I guess one of your questions on V3 has related to this.
>>>>>>>
>>>>>> I did not mean the case of inst and term failed of su.
>>>>>> The SUSI may be in transition

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

2016-03-02 Thread minh chau

Hi Nagu, Praveen

From patch 09 to patch 14, they are fixes for bugs that you also need 
on top of patches #4.
The problems you reported should not happen if you have them. They are 
regardless whether we *reboot node if transient states* or *adjust 
transient states* (delayed failover).


Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless
Patch 10 -> Resend pg information to directors after headless
Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping su, 
and (11_2) fix amfnd coredump given that we allow comp/su failover 
(patch #5). I split them

Patch 12 -> Do not disable healthy SU
Patch 13 -> It's for one payload limitation
Patch 14 -> It's for transient state at csi level, written on top of 
patch #6.


So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and they 
need to be reviewed and pushed together with #1->#4 as well.


The patch #5 #6 #7 #8 are on different view from "immediate escalation" 
and "reboot node if transient states".


We will look at your assignment_recovery.patch.

I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix the 
issue in TC #27, but it also depends on conclusion of how to deal with 
transient states after headless.


Thanks,
Minh

On 03/03/16 02:12, Nagendra Kumar wrote:


#1 I have applied patches #1 to #4 only. With this patches(not having 
patch #6), I thought to have passed most of the following tests, but 
they got failed(Listed below).


I could not test other scenarios (including alarms and notifications), 
because I haven’t applied patch #6. I think there should be a simple 
patch replacing patch #6, which handles transient state as ‘reboot the 
node‘ if Amf finds SUSI in transient state on that node.


I am attaching a concept patch(assignment_recovery.patch), which pass 
some of the scenarios and we are testing and enhancing it.


As Praveen has suggested that we need to reboot the node which is 
undergoing in transient state to make it simple.


This patch reduces complexity and maintainability.

So, ACK for patch #1-#4 along with the attached patch.

Please note that the attached patch has been created on patch #6 of 
yours, so please apply #1 to #4 and then #6 and then the attached patch.


Currently the patch is for 2N red model. We are working to make for 
Nway Act and No red model (and possibly for Nway and NpM), we will 
publish it tomorrow.


TC #1:

Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover as 
false) and logs attached(TC 1) in the ticket.


1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.

2. Stop SC-1 and kill demo. It goes for comp failover as configured. 
Ideally, node should reboot.


3. Start SC-1. After cluster timer expires, PL-4 got the following 
error messages:


Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
SI=safSi=AmfDemo,safApp=AmfDemo1


Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
SI=safSi=AmfDemo1,safApp=AmfDemo1


There is no assignment given for SU1. SU2 has Standby assignments:

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1

saAmfSISUHAState=STANDBY(2)

saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)

safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1

saAmfSISUHAState=STANDBY(2)

Other problems: a.) Further command for locking SU1/SU2 fails in SG 
unstable error.


b.) Immlist if SU2 gives the below 
result, Standby assignment it prints as 4, which is wrong:


saAmfSUNumCurrStandbySIs SA_UINT32_T  4 (0x4)

saAmfSUNumCurrActiveSIs SA_UINT32_T  0 (0x0)

c.) Even if SC-2 joins, and you do 
failover/switchover of SC-1, still same as above.


TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 
assignment should change to Act, which is not happening. After 
stopping of PL-4 also, the same problems as TC #1. logs attached(TC 2).


TC #3: After TC #2, start PL-3 and start SC-2.

SU1 is instantiated, but no assignment and the same 
problem as above.


When stop PL-4, SU1 gets assignments, the following 
logs comes at SC-2:


Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 
does not exist


Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo1,safApp=AmfDemo1 
does not exist


Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link 
<1.1.2:eth0-1.1.4:eth0>, peer not responding


Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link 
<1.1.2:eth0-1.1.4:eth0> on network plane A


Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact with 
<1.1.4>


Start PL-4, SU2 gets Standby assignments and everything works fine 
after that.


TC #4: Similar pro

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

2016-03-02 Thread minh chau
Hi Nagu, Praveen,

I have been trying your patch, with the test case below:
Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb)
1. issue admin command shutdown SG
2. Hanging quiescing csi_set callback
3. Stop both SCs
4. Stop PL4
5. Restart both SCs

I have seen this error after SCs come back also:
SC-2 osafamfd[477]: ER avd_ckpt_siass: 
safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon 
safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist

 From trace file, after amfd sends reboot message to PL5, realign() is 
called. Then realign() creates duplicated SUSI for SU5, this duplicated 
SUSI is not checked point at SC-2.
PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to 
delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and 
checkpointed, the second one will cause "ER avd_ckpt_siass: ... does not 
exisit"

This error should be happening with lock/shutdown SG/SU/Node/NodeGroup. 
And Nodegroup is being stuck in SHUTTING_DOWN
I think these kinds of issue will be fixed by you eventually, but all of 
these, looking through the concept patch, the complexity/maintainability 
is similar to patch #6. Both have to scan through all SU/SI to determine 
transient SUSI. The difference is decision to be made, one can reboot 
the node, another can adjust the state. Though it seems rebooting node 
will loose the availability?

Thanks,
Minh


On 03/03/16 11:32, minh chau wrote:
> Hi Nagu, Praveen
>
> From patch 09 to patch 14, they are fixes for bugs that you also need 
> on top of patches #4.
> The problems you reported should not happen if you have them. They are 
> regardless whether we *reboot node if transient states* or *adjust 
> transient states* (delayed failover).
>
> Patch 09 -> Return TRY_AGAIN for pg track start/stop in headless
> Patch 10 -> Resend pg information to directors after headless
> Patch 11 -> There are two fixes in this patch: (11_1) Fix mapping su, 
> and (11_2) fix amfnd coredump given that we allow comp/su failover 
> (patch #5). I split them
> Patch 12 -> Do not disable healthy SU
> Patch 13 -> It's for one payload limitation
> Patch 14 -> It's for transient state at csi level, written on top of 
> patch #6.
>
> So you also need patch 09, 10, 11_1, 12, 13 on top of patch #4, and 
> they need to be reviewed and pushed together with #1->#4 as well.
>
> The patch #5 #6 #7 #8 are on different view from "immediate 
> escalation" and "reboot node if transient states".
>
> We will look at your assignment_recovery.patch.
>
> I also attach patch 1620_amfd_adjust_csi_V2.diff, which is to fix the 
> issue in TC #27, but it also depends on conclusion of how to deal with 
> transient states after headless.
>
> Thanks,
> Minh
>
> On 03/03/16 02:12, Nagendra Kumar wrote:
>>
>> #1 I have applied patches #1 to #4 only. With this patches(not having 
>> patch #6), I thought to have passed most of the following tests, but 
>> they got failed(Listed below).
>>
>> I could not test other scenarios (including alarms and 
>> notifications), because I haven’t applied patch #6. I think there 
>> should be a simple patch replacing patch #6, which handles transient 
>> state as ‘reboot the node‘ if Amf finds SUSI in transient state on 
>> that node.
>>
>> I am attaching a concept patch(assignment_recovery.patch), which pass 
>> some of the scenarios and we are testing and enhancing it.
>>
>> As Praveen has suggested that we need to reboot the node which is 
>> undergoing in transient state to make it simple.
>>
>> This patch reduces complexity and maintainability.
>>
>> So, ACK for patch #1-#4 along with the attached patch.
>>
>> Please note that the attached patch has been created on patch #6 of 
>> yours, so please apply #1 to #4 and then #6 and then the attached patch.
>>
>> Currently the patch is for 2N red model. We are working to make for 
>> Nway Act and No red model (and possibly for Nway and NpM), we will 
>> publish it tomorrow.
>>
>> TC #1:
>>
>> Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover 
>> as false) and logs attached(TC 1) in the ticket.
>>
>> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.
>>
>> 2. Stop SC-1 and kill demo. It goes for comp failover as configured. 
>> Ideally, node should reboot.
>>
>> 3. Start SC-1. After cluster timer expires, PL-4 got the following 
>> error messages:
>>
>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
>> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
>> SI=safSi=AmfDemo,safApp=AmfDemo1
>>
>> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

2016-03-03 Thread minh chau
Hi Praveen,

Please see my comments in line with [Minh]

Thanks,
Minh

On 04/03/16 00:41, praveen malviya wrote:
> Hi Minh,
>
> The second version of the patches you had published handles immediate 
> escalation only(1 to 4) but it does not performs 'immediate escalation'
> during the transient phases.
[Minh] The patch version is important to be sure we are in the same 
view. The latest version is V4 (not V2) that has immediate escalation in 
amfnd. Perform "immediate escalation during transient phases" you mean 
to me is "reboot node that has transient SUSI", and it is suggested 
after V4 were published. As far as concerns, we agree to push "immediate 
escalation" (amfnd) to base patches (#1 to #4) and separate "delayed 
failover" (amfd) to another patch. Then from there, we will review and 
see whether or not "delayed failover" is necessary
>
> So, the concept patch is not for "delayed failover" approach but for 
> doing 'Immediate escalation' during transient states also.
> The 'immediate escalation' approach becomes **complete** with the 
> concept patch. Ofcourse, as mentioned before i would update the 
> concept patch further.
>
> Regarding the scanning of SUSIs in SG, it is scanned just to know the 
> active and standby SU but not to handle the transition state at susi 
> level. After rebooting the node, existing node-failover functionality 
> of SG FSM will take care of things at SUSI level including si deps for 
> all red models. In fact, later on the patch can be evolved to call 
> existing SG FSM code.
[Minh] As mentioned in previous email, I understand the concept patch is 
under going and issues will be fixed eventually (I would rather say a 
completion). But my question was on the *value* it gives at the end. 
Many healthy applications will claim losing availability since a node 
reboot because of a transient SUSI in another (unimportant) one, and 
node reboot is unexpected per configuration
As I understand the complexity/maintainability of AMF code is important 
for maintainers, but is there any other reasons that support "immediate 
escalation"? If it's the case, the concept patch seems to sacrifice 
availability to gain less complexity/maintainability of code. But if we 
all agree with availability is most important, then 
complexity/maintainability is just matter of coding?
>
> I think in the version1 of patches, I had given comments for SI deps 
> and delayed fail-over getting mixed and the way SI dependecy has been 
> scanned. I never got the responses of those comments and other 
> comments of v1 on amfd patch. Those are important comments and needs 
> to be addressed.
[Minh] We have received 2 emails for comments on V2 so far and all of 
those had been responded. In V4 we have corrected patches according to 
some of your comments

Belows are date time of responses were sent

Date: Fri, 12 Feb 2016 11:13:03 +1100
From: minh chau
Subject: Re: [devel] [PATCH 1 of 5] amfd: Add README file for cloud
 resilience support [#1620] V2
To: praveen malviya,
hans.nordeb...@ericsson.com,gary@dektech.com.au,
nagendr...@oracle.com
Cc:opensaf-devel@lists.sourceforge.net


Date: Fri, 26 Feb 2016 14:41:18 +1100
From: minh chau
Subject: Re: [devel] [PATCH 2 of 5] amfd: Add support for cloud
 resilience at director [#1620] V2
To: praveen malviya,
hans.nordeb...@ericsson.com,gary@dektech.com.au,
nagendr...@oracle.com
Cc:opensaf-devel@lists.sourceforge.net
>
> Regarding the approach taken in delayed_failover() functionality, I do 
> not know whether it has been explored or not, but it does not use 
> existing SG FSM code. Using the existing code will keep it simple too.
[Minh] What's SG FSM code you think it should be used? If there's any 
inappropriate codes, can we all go through it and optimize it?
>
> Thanks,
> Praveen
>
> On 03-Mar-16 1:20 PM, minh chau wrote:
>> Hi Nagu, Praveen,
>>
>> I have been trying your patch, with the test case below:
>> Setup 2N model, PL4 host SU4 (act), PL5 host SU5(stb)
>> 1. issue admin command shutdown SG
>> 2. Hanging quiescing csi_set callback
>> 3. Stop both SCs
>> 4. Stop PL4
>> 5. Restart both SCs
>>
>> I have seen this error after SCs come back also:
>> SC-2 osafamfd[477]: ER avd_ckpt_siass:
>> safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon
>> safSi=AmfDemoTwon,safApp=AmfDemoTwon does not exist
>>
>>  From trace file, after amfd sends reboot message to PL5, realign() is
>> called. Then realign() creates duplicated SUSI for SU5, this duplicated
>> SUSI is not checked point at SC-2.
>> PL5 reboot, node_fail() of 2N SG calls AVD_SU::delete_all_susis() to
>> delete all SUSI of SU5. Now 2 duplicated SUSI are deleted and
>

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

2016-03-04 Thread minh chau
Hi Nagu,

 From your test description TC#1, it says SU2 hosted on SC-2. And after 
SC-1 comes back, SU2 on PL-4 gets assignment.
This description is symptom of su mapping issue which is addressed in 
patch 11_1.

But now looking at the trace file, SU2 is being hosted in PL-4 actually 
(not as in description). And errors relating SU-SI "record addition 
failed", "avd_ckpt_siass: ... does not exist",... are because the 
patches set you are applying missing #5 #6 #7 #8 (delayed failover) or 
concept patch (reboot transient state)

amfd after headless needs something to adjust the transient states or 
reboot the whole PL.

How's about if you have #5 #6 #7 #8 applied?

Thanks,
Minh

On 04/03/16 17:45, Nagendra Kumar wrote:
> Hi,
>   I have conducted the same 9 test cases sent on Mar 2 (in review 
> response) with the patches #1-#4along with attached patches(#9-#13).
>
> The summary of the results: All the 9 test cases have failed except in TC #2, 
> in which stopping PL-4 has worked.
>
> ==
> TC #1: Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover as 
> false) and logs attached(New TC 1) in the ticket.
> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.
> 2. Stop SC-1 and kill demo. It goes for comp failover as configured. Ideally, 
> node should reboot.
> 3. Start SC-1. After cluster timer expires, PL-4 got the following error 
> messages:
>
> Mar  4 10:10:15 PM_PL-4 osafamfnd[10290]: CR SU-SI record addition failed, 
> SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : SI=safSi=AmfDemo,safApp=AmfDemo1
> Mar  4 10:10:15 PM_PL-4 osafamfnd[10290]: CR SU-SI record addition failed, 
> SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
> SI=safSi=AmfDemo1,safApp=AmfDemo1
>
> There is no assignment given for SU1. SU2 has Standby assignments:
> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>  saAmfSISUHAState=STANDBY(2)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>  saAmfSISUHAState=STANDBY(2)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>
> Other problems: a.) Further command for locking SU1/SU2 fails in SG unstable 
> error.
>  b.) Immlist if SU2 gives the below result, 
> Standby assignment it prints as 4, which is wrong:
> saAmfSUNumCurrStandbySIs   SA_UINT32_T  4 (0x4)
> saAmfSUNumCurrActiveSIsSA_UINT32_T  0 (0x0)
>  c.) Even if SC-2 joins, and you do 
> failover/switchover of SC-1, still same as above.
>
> TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 assignment 
> should change to Act, which is not happening.  SU2 still holds Standby 
> assignment:
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>  saAmfSISUHAState=STANDBY(2)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>  saAmfSISUHAState=STANDBY(2)
>  saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>
> Failure message same as above TC #1:
> Mar  4 10:40:18 PM_PL-4 osafamfnd[12749]: CR SU-SI record addition failed, 
> SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : SI=safSi=AmfDemo,safApp=AmfDemo1
> Mar  4 10:40:18 PM_PL-4 osafamfnd[12749]: CR SU-SI record addition failed, 
> SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
> SI=safSi=AmfDemo1,safApp=AmfDemo1
>
> But after stopping of PL-4, Assignments are gone, which is good. I am able to 
> lock/unlock the SU1.
> The configuration and logs attached(New TC 2).
>
> TC #3: After TC #2(before stopping PL-4), start PL-3 and start SC-2.
>
>  SU1 is instantiated, but no assignment and the same problem 
> as above.
>
>  When stop PL-4, SU1 gets Act assignments, the following logs 
> comes at SC-2:
>
> Mar  4 10:59:22 PM_SC-2 osafamfd[11449]: ER avd_ckpt_siass: 
> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 does 
> not exist
> Mar  4 10:59:22 PM_SC-2 osafamfd[11449]: ER avd_ckpt_siass: 
> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo1,safApp=AmfD

Re: [devel] [PATCH 0 of 5] Review Request for Add cloud resilience support [#1180] V2

2016-03-04 Thread minh chau
Hi Lennart,

The important change I think should be looked at, is a bug Praveen has 
found in Unsubscribed()/ReadFinalize(), which are not aligned with README.

Thanks,
Minh

On 03/03/16 23:58, Lennart Lund wrote:
> Hi Minh,
>
> Ack.
>
> I have not done a very deep analyze of this I assume it's the same code that 
> has already been tested for some time and that there are no significant 
> changes (and that I have reviewed once). Please tell me if there are any 
> changes that you think I should take a closer look at.
> I have applied all the patches, built and run the legacy tests in the 
> non-resilience configuration and all tests PASS.
>
> Regards
> Lennart
>
>> -Original Message-
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 1 mars 2016 08:30
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen; Minh
>> Chau H
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [PATCH 0 of 5] Review Request for Add cloud resilience support
>> [#1180] V2
>>
>> Summary: ntf: Add cloud resilience support [#1180] V2
>> Review request for Trac Ticket(s): 1180
>> Peer Reviewer(s): Lennart, Praveen, Vu
>> Pull request to: NTF maintainers
>> Affected branch(es): default
>> Development branch: default
>>
>> 
>> Impacted area   Impact y/n
>> 
>>   Docsn
>>   Build systemn
>>   RPM/packaging   n
>>   Configuration files n
>>   Startup scripts n
>>   SAF servicesy
>>   OpenSAF servicesn
>>   Core libraries  n
>>   Samples n
>>   Tests   n
>>   Other   n
>>
>>
>> Comments (indicate scope for each "y" above):
>> -
>> This V2 has revised comments:
>> - Update description of checkNtfServerState
>> - Not using conditional operator in ntfa_mds_svc_evt
>> - Update Unsubscribe() ReadFinalize() aligned with README
>> - Add lock/unlock ntfa_cb.cb_lock for client recovery
>> - Update ntftest options: -ve is for tag mode only, -vpe works
>>
>>
>> changeset 884d1bdbea715fbc81941a0941c2d3f799a4395e
>> Author:  Minh Hon Chau 
>> Date:Tue, 01 Mar 2016 18:25:15 +1100
>>
>>  NTF: Add support cloud resilience for NTF libs common [#1180]
>>
>>  The patch contains support for cloud resilience feature in NTF
>> libs common
>>  which are mostly used in Agent code
>>
>> changeset 6d941afbcd475e1ecf58c6f9586e5ff60a7a3319
>> Author:  Minh Hon Chau 
>> Date:Tue, 01 Mar 2016 18:26:53 +1100
>>
>>  NTF: Add support cloud resilience for NTF Agent [#1180] V2
>>
>>  The patch contains support for cloud resilience feature in NTF
>> Agent code.
>>  Please refer README.HYDRA for content of the changes
>>
>> changeset ddd2369c000c3648466c06b8babad4b5884a0058
>> Author:  Minh Hon Chau 
>> Date:Tue, 01 Mar 2016 18:27:03 +1100
>>
>>  NTF: Add wrapper for usage of NTF API in ntftools to handle
>> TRY_AGAIN
>>  [#1180]
>>
>>  Since NTF support the SC outage which the NTF client has to
>> handle TRY_AGAIN
>>  return code, the patch adds wrapper for APIs being used in
>> ntftools that
>>  shall receives TRY_AGAIN when both SCs are down.
>>
>> changeset 67286bb9852bcfde837c009801826202f2905a5f
>> Author:  Minh Hon Chau 
>> Date:Tue, 01 Mar 2016 18:27:09 +1100
>>
>>  NTF: Add new README file for description of cloud resilience
>> support [#1180]
>>  V2
>>
>>  Add description regarding general solution and API
>> implementation for cloud
>>  resilience support in NTF
>>
>> changeset ad9d91747c80faf1defd86a539f6238a997150b0
>> Author:  Minh Hon Chau 
>> Date:Tue, 01 Mar 2016 18:27:18 +1100
>>
>>  NTF: Add tests for NTF cloud resilience feature [#1180] V2
>>
>>  The patch adds new test cases to ntftest for cloud resilience
>> feature.
>>
>>
>> Complete diffstat:
>> --
>>   osaf/libs/agents/saf/ntfa/ntfa.h  |31 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_api.c  |   678
>> +++--
>>   osaf/libs/agents/saf/ntfa/ntfa_mds.c  |14 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_util.c |   465 +++-
>>   osaf/li

Re: [devel] Proof Of Concept patch reusing SG FSM code for better handling of transient nodes during headless state(was Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#162

2016-03-09 Thread minh chau
fter first
>> controller comes up and completed the admin operation as it does now in
>> normal cluster. Also UNLOCK operation was successful in all the cases.
>>
>>  In the delayed_failover approach (06-08), the problem was HA state
>> of SU for each SI was not considered and each SUSI was assumed assigned.
>> Because of this, original state of SU and hence SG FSM could not be
>> resumed.
>>
>> Approach in this SG FSM recovery patch:
>>  It recovers each SUSI FSM state and using this it resumes SG in
>> same FSM state as it was before controllers went down.Thus it will use
>> the original SG FSM code.
>>
>> Some benefits of this approach:
>>  1) Existing code of SG FSM can be used.
>>  2) Does not require node reboot in transition state.
>>  3) SG FSM code for each model already handles faults, si deps and
>> all admin operation so always any issue will just require deducing the
>> SG FSM state at the time of controller down and resuming SG in the same
>> state.
>>  4)There are FIVE SG FSM states in our code out of which STABLE
>> state of SG is not applicable for transition state. So there are only
>> FOUR SG fsm states to be resumed.
>>
>> Note: For testing admin op, cluster was freshly started for each lock
>> and shutdown operation as assignment counter related changes is not done
>> in this patch.
>>
>> Thanks,
>> Praveen.
>>
>>
>>
>> On 04-Mar-16 9:11 AM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Please see my comments in line with [Minh]
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 04/03/16 00:41, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> The second version of the patches you had published handles immediate
>>>> escalation only(1 to 4) but it does not performs 'immediate 
>>>> escalation'
>>>> during the transient phases.
>>> [Minh] The patch version is important to be sure we are in the same
>>> view. The latest version is V4 (not V2) that has immediate 
>>> escalation in
>>> amfnd. Perform "immediate escalation during transient phases" you mean
>>> to me is "reboot node that has transient SUSI", and it is suggested
>>> after V4 were published. As far as concerns, we agree to push 
>>> "immediate
>>> escalation" (amfnd) to base patches (#1 to #4) and separate "delayed
>>> failover" (amfd) to another patch. Then from there, we will review and
>>> see whether or not "delayed failover" is necessary
>>>>
>>>> So, the concept patch is not for "delayed failover" approach but for
>>>> doing 'Immediate escalation' during transient states also.
>>>> The 'immediate escalation' approach becomes **complete** with the
>>>> concept patch. Ofcourse, as mentioned before i would update the
>>>> concept patch further.
>>>>
>>>> Regarding the scanning of SUSIs in SG, it is scanned just to know the
>>>> active and standby SU but not to handle the transition state at susi
>>>> level. After rebooting the node, existing node-failover functionality
>>>> of SG FSM will take care of things at SUSI level including si deps for
>>>> all red models. In fact, later on the patch can be evolved to call
>>>> existing SG FSM code.
>>> [Minh] As mentioned in previous email, I understand the concept 
>>> patch is
>>> under going and issues will be fixed eventually (I would rather say a
>>> completion). But my question was on the *value* it gives at the end.
>>> Many healthy applications will claim losing availability since a node
>>> reboot because of a transient SUSI in another (unimportant) one, and
>>> node reboot is unexpected per configuration
>>> As I understand the complexity/maintainability of AMF code is important
>>> for maintainers, but is there any other reasons that support "immediate
>>> escalation"? If it's the case, the concept patch seems to sacrifice
>>> availability to gain less complexity/maintainability of code. But if we
>>> all agree with availability is most important, then
>>> complexity/maintainability is just matter of coding?
>>>>
>>>> I think in the version1 of patches, I had given comments for SI deps
>>>> and delayed fail-over getting mixed and the way SI dependecy has been
>>>> scanned. I never got the responses of those comments and othe

Re: [devel] [PATCH 0 of 5] Review Request for Add cloud resilience support [#1180] V2

2016-03-09 Thread minh chau
Hi Lennart,

Thanks for you finding. In future I think we need to add more ntftest 
cases for using NTF API in threads concurrently. We had a few.
I will publish next version patch

Thanks,
Minh

On 09/03/16 22:57, Lennart Lund wrote:
> Hi Minh,
>
> I found a function checkNtfServerState() that read the global 
> ntfa_ntfsv_state variable. This function does not protect the variable with a 
> mutex. The corresponding  ntfa_update_ntfsv_state() is called when mutex is 
> locked but the checkNtfServerState() is not, this is not thread safe.
>
> Regards
> Lennart
>
>> -Original Message-
>> From: minh chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 4 mars 2016 09:40
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 0 of 5] Review Request for Add cloud resilience support
>> [#1180] V2
>>
>> Hi Lennart,
>>
>> The important change I think should be looked at, is a bug Praveen has
>> found in Unsubscribed()/ReadFinalize(), which are not aligned with README.
>>
>> Thanks,
>> Minh
>>
>> On 03/03/16 23:58, Lennart Lund wrote:
>>> Hi Minh,
>>>
>>> Ack.
>>>
>>> I have not done a very deep analyze of this I assume it's the same code that
>> has already been tested for some time and that there are no significant
>> changes (and that I have reviewed once). Please tell me if there are any
>> changes that you think I should take a closer look at.
>>> I have applied all the patches, built and run the legacy tests in the non-
>> resilience configuration and all tests PASS.
>>> Regards
>>> Lennart
>>>
>>>> -Original Message-
>>>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>>>> Sent: den 1 mars 2016 08:30
>>>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen; Minh
>>>> Chau H
>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>> Subject: [PATCH 0 of 5] Review Request for Add cloud resilience support
>>>> [#1180] V2
>>>>
>>>> Summary: ntf: Add cloud resilience support [#1180] V2
>>>> Review request for Trac Ticket(s): 1180
>>>> Peer Reviewer(s): Lennart, Praveen, Vu
>>>> Pull request to: NTF maintainers
>>>> Affected branch(es): default
>>>> Development branch: default
>>>>
>>>> 
>>>> Impacted area   Impact y/n
>>>> 
>>>>Docsn
>>>>Build systemn
>>>>RPM/packaging   n
>>>>Configuration files n
>>>>Startup scripts n
>>>>SAF servicesy
>>>>OpenSAF servicesn
>>>>Core libraries  n
>>>>Samples n
>>>>Tests   n
>>>>Other   n
>>>>
>>>>
>>>> Comments (indicate scope for each "y" above):
>>>> -
>>>> This V2 has revised comments:
>>>> - Update description of checkNtfServerState
>>>> - Not using conditional operator in ntfa_mds_svc_evt
>>>> - Update Unsubscribe() ReadFinalize() aligned with README
>>>> - Add lock/unlock ntfa_cb.cb_lock for client recovery
>>>> - Update ntftest options: -ve is for tag mode only, -vpe works
>>>>
>>>>
>>>> changeset 884d1bdbea715fbc81941a0941c2d3f799a4395e
>>>> Author:Minh Hon Chau 
>>>> Date:  Tue, 01 Mar 2016 18:25:15 +1100
>>>>
>>>>NTF: Add support cloud resilience for NTF libs common [#1180]
>>>>
>>>>The patch contains support for cloud resilience feature in NTF
>>>> libs common
>>>>which are mostly used in Agent code
>>>>
>>>> changeset 6d941afbcd475e1ecf58c6f9586e5ff60a7a3319
>>>> Author:Minh Hon Chau 
>>>> Date:  Tue, 01 Mar 2016 18:26:53 +1100
>>>>
>>>>NTF: Add support cloud resilience for NTF Agent [#1180] V2
>>>>
>>>>The patch contains support for cloud resilience feature in NTF
>>>> Agent code.
>>>>Please refer README.HYDRA for content of the changes
>>>>
>>>> changeset ddd2369c000c3648466c06b8babad4b5884a0058
>>>> Author:Minh Hon Chau 
>>>> Da

Re: [devel] Proof Of Concept patch reusing SG FSM code for better handling of transient nodes during headless state(was Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#162

2016-03-10 Thread minh chau
To clarify my doubts: These sg fsm code are working in non-headless in 
the way it prevents user issue a new admin op while the previous admin 
op has been in progress, because sg shares one fsm state. Moreover, in 
faulty cases, they are happening subsequently in timing order, like: si 
admin op is issued, then faulty su happens during assignment. But after 
headless, everything had already happened so how these fsm code will be 
called in the right order (or running concurrently) on all entities 
while sharing one fsm state.

Thanks,
Minh

On 10/03/16 16:13, minh chau wrote:
> Hi Praveen
>
> Thanks for PoC patch, I have been reading your patch, and here is my 
> understanding, please correct me if I am wrong.
> The approach of the patch in general is trying to pretend there's no 
> headless gap, the operations before headless will resume after SC 
> comes back.
> To achieve this, node director now has to give more information about 
> ha/assigning state so that director can resume sg fsm state.
>
> The PoC patch seems to add more code for SI than the others, so I 
> tried to play with it a bit
> Below is my initial testing for 3 favorite test cases and findings:
>
> 1- Setup 2N app (act SU on PL4, stb SU on PL5). Stop SCs, stop PL4. 
> Restart SC-1.
> I got this error
> 2016-03-10 13:11:35 PL-5 osafamfnd[418]: CR SU-SI record addition 
> failed, SU= safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon : 
> SI=safSi=AmfDemoTwon,safApp=AmfDemoTwon
>
> -> There's no uncompleted admin op, so SG FSM state set as REALIGN, 
> realign() will be called accordingly. At this moment, realign() is not 
> able to bring the remaining STANDBY to ACTIVE (unless it's modified)
>
> 2- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced 
> at "Assigning" state. Stop SCs, release csi_set cb at "Assigned" 
> state. Restart SC-1
>I got 2 SUs: 1 STANDBY, 1 QUIESCED
>
> 3- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced 
> at "Assigning" state. Stop SCs, restart SC-1, now release csi_set cb 
> at "Assigned" state.
>I got 2 SUs: 1 STANDBY, 1 QUIESCED
>
> -> I think 2- and 3- have the same root cause, after setting SG FSM 
> state as SI_OPER, the corresponding SG FSM code should be called is 
> si_admin_down(). I have tried to get it called in resume_sg_fsm() but 
> it is not working, it requires @admin_si to be set and needs to be 
> cleared at the end.
>
> I have a doubt that these admin operation SG FSM code, all of those 
> are normally started from the top sequence where are originated from 
> IMM admin callback. Now these SG FSM code are called in the way which 
> it is not supposed to be. I suspect there will be (many) changes in SG 
> FSM code to get it work after headless.
> Another thing, uncompleted admin op could be left over from headless, 
> but there could be a node reboot due to error in the other nodes 
> during headless. In such cases, wondering if these SG FSM currently 
> can handle this or it could be stuck somewhere down the track.
>
> The PoC patch is at very early stage I think, and at this moment I 
> don't know if the approach is working until it goes to the end of the 
> road.
> I suggest to test the completed PoC patch for 2N as below:
>
> - For each @entity in SI/SU/SG/node/nodegroup
> - For each @admin supported for this @entity
>  - Issue @admin command
>  - For each @callback for ACTIVE/STANDBY/QUIESCED/QUIESCING 
> received at component
>  - For each @delay of Assigning, Assigned in @callback
> - Test 1: Stop SC, release @delay, start SC
> - Test 2: Stop SC, release @delay, stop PL, start SC
> - Test 3: Stop SC, start SC, release @delay
> - Test 4: Stop SC, stop PL start SC, release @delay
> Check if after headless, amf-state looks right
>
> So the test (I hope) will scan through all SG FSM code of 2N
>
> One minor clarification for delayed_failover approach: Amfd comes back 
> from headless can not know what was happening during headless: SUSI 
> could (or not) be completed, some of SUs were fail-overed, nodes could 
> be rebooted, ... . Therefore, delayed_failover() works like a garbage 
> collector, pick up inappropriate SUSI states, set them back to the 
> right ones. Then SG FSM can start afterwards as STABLE state. In 
> maintainability argument, it's likely a "plug-in" on top of current SG 
> FSM code, and it's separated from SG FSM code. It currently works for 
> the above test I suggested, though there could be something left to be 
> improved.
>
> For now, I don't know which approach is better than the other

Re: [devel] Proof Of Concept patch reusing SG FSM code for better handling of transient nodes during headless state(was Re: [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#162

2016-03-11 Thread minh chau


On 11/03/16 17:23, praveen malviya wrote:
>
>
> On 10-Mar-16 5:31 PM, minh chau wrote:
>> To clarify my doubts: These sg fsm code are working in non-headless in
>> the way it prevents user issue a new admin op while the previous admin
>> op has been in progress, because sg shares one fsm state. Moreover, in
>> faulty cases, they are happening subsequently in timing order, like: si
>> admin op is issued, then faulty su happens during assignment. But after
>> headless, everything had already happened so how these fsm code will be
>> called in the right order (or running concurrently) on all entities
>> while sharing one fsm state.
>>
> When SG is unstable because of some admin operation or because of 
> faults, no other admin operation is entertained in normal cluster and 
> this will be valid after headless state if resume state of SG is not a 
> stable state. Regarding faults, it has its own cycle from fault 
> isolation, recovery and repair. Current AMF code (including SG FSM 
> code like sg->node_fail()), handles faults during admin operations 
> also and this will also handle the same case in headless state also.
>
> If some faults happen during headless state it will reflect after 
> headless state in form of standard states of AMF entities like this:
> 1) AMFND will not be able to provide SUSI as it would have deleted 
> then based on nature of faults.
> 2) Operation state of entities will be disabled. etc
> 3) Presence state will be other than instantiated.
> Based on nature of faults, resume state of SG will go one step ahead 
> of what it was before headless state was observed.
>
> Regarding the testing over this patch, it is only for presenting the 
> idea or approach and is too crude to be used of testing. Anyways I 
> have attached in the ticket #1620 app.xml and poc.tgz, I used one 
> controller and one payload for testing. poc.tgz contains AMFD trace 
> from controller and AMFND trace from payload. Traces are for 
> successful LOCK operation on SG,SU and SI with quiesced state in 
> separate directory.
>
I thought I should give feedback on the PoC patch as it passed in some 
tests as introduced previously.
Note that there should be various configuration where multiple 
(in-serv/spare) SUs and SI belong to SG, some of admin op on them had 
been done while another was in progress just in time going headless. 
Anyway please go ahead and let us know when it's ready.
It would be great if we have the patch at least a week before code 
freeze day, since it takes time to verify with our test cases and 
troubleshoot if needed.

Thanks,
Minh
>
> Thanks,
> Praveen
>> Thanks,
>> Minh
>>
>> On 10/03/16 16:13, minh chau wrote:
>>> Hi Praveen
>>>
>>> Thanks for PoC patch, I have been reading your patch, and here is my
>>> understanding, please correct me if I am wrong.
>>> The approach of the patch in general is trying to pretend there's no
>>> headless gap, the operations before headless will resume after SC
>>> comes back.
>>> To achieve this, node director now has to give more information about
>>> ha/assigning state so that director can resume sg fsm state.
>>>
>>> The PoC patch seems to add more code for SI than the others, so I
>>> tried to play with it a bit
>>> Below is my initial testing for 3 favorite test cases and findings:
>>>
>>> 1- Setup 2N app (act SU on PL4, stb SU on PL5). Stop SCs, stop PL4.
>>> Restart SC-1.
>>> I got this error
>>> 2016-03-10 13:11:35 PL-5 osafamfnd[418]: CR SU-SI record addition
>>> failed, SU= safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon :
>>> SI=safSi=AmfDemoTwon,safApp=AmfDemoTwon
>>>
>>> -> There's no uncompleted admin op, so SG FSM state set as REALIGN,
>>> realign() will be called accordingly. At this moment, realign() is not
>>> able to bring the remaining STANDBY to ACTIVE (unless it's modified)
>>>
>>> 2- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced
>>> at "Assigning" state. Stop SCs, release csi_set cb at "Assigned"
>>> state. Restart SC-1
>>>I got 2 SUs: 1 STANDBY, 1 QUIESCED
>>>
>>> 3- Setup 2N app (as 1-). Lock SI, delay csi_set callback for quiesced
>>> at "Assigning" state. Stop SCs, restart SC-1, now release csi_set cb
>>> at "Assigned" state.
>>>I got 2 SUs: 1 STANDBY, 1 QUIESCED
>>>
>>> -> I think 2- and 3- have the same root cause, after setting SG FSM
>>> state as SI_OPER, the corresponding SG FSM code should be called is
>>> si_admin_dow

Re: [devel] [PATCH 01 of 15] amfd: Add support for cloud resilience at common libs [#1620]

2016-03-14 Thread minh chau
Hi Nagu, Praveen

Since #1-#4 have been acked, can you please push them?
#5 and #11_2 allows comp/su failover during headless, so we may have to 
visit them later.
However, the patches: #9 #10 #11_1 #12 #13 are bug fixes that does not 
relate to *delayed failover* and needed for #1-#4. Can you please have a 
look?

Thanks,
Minh

On 03/03/16 02:12, Nagendra Kumar wrote:
>
> #1 I have applied patches #1 to #4 only. With this patches(not having 
> patch #6), I thought to have passed most of the following tests, but 
> they got failed(Listed below).
>
> I could not test other scenarios (including alarms and notifications), 
> because I haven’t applied patch #6. I think there should be a simple 
> patch replacing patch #6, which handles transient state as ‘reboot the 
> node‘ if Amf finds SUSI in transient state on that node.
>
> I am attaching a concept patch(assignment_recovery.patch), which pass 
> some of the scenarios and we are testing and enhancing it.
>
> As Praveen has suggested that we need to reboot the node which is 
> undergoing in transient state to make it simple.
>
> This patch reduces complexity and maintainability.
>
> So, ACK for patch #1-#4 along with the attached patch.
>
> Please note that the attached patch has been created on patch #6 of 
> yours, so please apply #1 to #4 and then #6 and then the attached patch.
>
> Currently the patch is for 2N red model. We are working to make for 
> Nway Act and No red model (and possibly for Nway and NpM), we will 
> publish it tomorrow.
>
> TC #1:
>
> Configuration(Comp recovery is comp failover, saAmfSutDefSUFailover as 
> false) and logs attached(TC 1) in the ticket.
>
> 1. Start SC-1, PL-3 and PL-4. SU1 Act on PL-3 and SU2 Standby on SC-2.
>
> 2. Stop SC-1 and kill demo. It goes for comp failover as configured. 
> Ideally, node should reboot.
>
> 3. Start SC-1. After cluster timer expires, PL-4 got the following 
> error messages:
>
> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
> SI=safSi=AmfDemo,safApp=AmfDemo1
>
> Mar  2 08:01:15 PM_PL-4 osafamfnd[20050]: CR SU-SI record addition 
> failed, SU= safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 : 
> SI=safSi=AmfDemo1,safApp=AmfDemo1
>
> There is no assignment given for SU1. SU2 has Standby assignments:
>
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo,safApp=AmfDemo1
>
> saAmfSISUHAState=STANDBY(2)
>
> saAmfSISUHAReadinessState=READY_FOR_ASSIGNMENT(1)
>
> safSISU=safSu=SU2\,safSg=AmfDemo\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>
> saAmfSISUHAState=STANDBY(2)
>
> Other problems: a.) Further command for locking SU1/SU2 fails in SG 
> unstable error.
>
> b.) Immlist if SU2 gives the below 
> result, Standby assignment it prints as 4, which is wrong:
>
> saAmfSUNumCurrStandbySIs SA_UINT32_T  4 (0x4)
>
> saAmfSUNumCurrActiveSIs SA_UINT32_T  0 (0x0)
>
> c.) Even if SC-2 joins, and you do 
> failover/switchover of SC-1, still same as above.
>
> TC #2: After execution of TC #1, stop PL-3. In worst case, SU2 
> assignment should change to Act, which is not happening. After 
> stopping of PL-4 also, the same problems as TC #1. logs attached(TC 2).
>
> TC #3: After TC #2, start PL-3 and start SC-2.
>
> SU1 is instantiated, but no assignment and the same 
> problem as above.
>
> When stop PL-4, SU1 gets assignments, the following 
> logs comes at SC-2:
>
> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo,safApp=AmfDemo1 
> does not exist
>
> Mar  2 09:06:18 PM_SC-2 osafamfd[8518]: ER avd_ckpt_siass: 
> safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1 safSi=AmfDemo1,safApp=AmfDemo1 
> does not exist
>
> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784933] tipc: Resetting link 
> <1.1.2:eth0-1.1.4:eth0>, peer not responding
>
> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784947] tipc: Lost link 
> <1.1.2:eth0-1.1.4:eth0> on network plane A
>
> Mar  2 09:06:21 PM_SC-2 kernel: [ 3290.784956] tipc: Lost contact with 
> <1.1.4>
>
> Start PL-4, SU2 gets Standby assignments and everything works fine 
> after that.
>
> TC #4: Similar problems exist in the following test cases:
>
> a.)Configuration same as TC #1 except saAmfSutDefSUFailover as true.
>
> After killing demo, PL-3 went for reboot.
>
> But the problem is the same as shown in TC #1, TC #2 
> and TC #3.
>
> b.) Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError 
> as 2 and saAmfCtDefDisableRestart as 1.
>
> But the problem is the same as shown in TC #1, TC #2 
> and TC #3.
>
> c.)Configuration same as TC #1 except with  saAmfCtDefRecoveryOnError 
> as 2 and saAmfCtDefDisableRestart as 1 and saAmfSutDefSUFailover as 1.
>
> After killing demo, PL-3 went for reboot.
>
> But the problem is the same as shown in TC #1, TC #2 
> and 

Re: [devel] [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent [#1180] V3

2016-03-14 Thread minh chau
Hi Lennart,

The current code in agent now is using both protection of ncshm handle 
and global cb_lock mutex, and this happens in most of APIs with below 
patterns

// take lock
// doing something
// if not success
// unlock
// goto done
// unlock
// continue doing something
// done:

// take lock
// doing something
// if not success
// goto done
// done:
// unlock

These two patterns are now used in mix of both ncshm handle and cb_lock 
as well. Because of this, a lot of release locking at the end of 
function, and it becomes harder to maintain and easy to makes mistake 
that leads to deadlock.
In this case for cloud resilience, if ntfa_ntfsv_state is protected by 
cb_lock inside of calling functions which are APIs, that also introduces 
more duplicated above patterns, not only checkNtfServerState() but also 
somewhere just ntfa_ntfsv_state to be checked (Finalize, Unsubscribed, 
ReadFinalize).
Your comment is right also, there would be a risk that overlapping 
protection and this should be avoided in normal cases.
But I think in NTF API function it'd better:
 - cb_lock is taken away from API function, accessing cb resource 
should be written in separate function in which the cb_lock is used
 - Only ncshm handle protection is used in API, don't worry about 
cb_lock

Doing this way, I think the API code would look clear and reduce 
duplicated pattern.

I think we had a quick discussion and agreed that a ticket should be 
raised for this matter, but haven't come up a generic solution. The 
above is not final solution but I hope it makes the code a bit clearer. 
And that's the way I was trying to thread safe the cloud resilience 
variables. If it sounds right to you I think we can leave the rest for 
another ticket. Otherwise, I can also change back to the way that 
cb_lock has being used in above patterns so far. Then we may have 
further discussion on this.

Thanks,
Minh



On 14/03/16 22:59, Lennart Lund wrote:
> Hi Minh,
>
> I see that you are using mutexes inside the checkNtfServerState(). I don't 
> think this is a good solution since the same mutex  is used directly in the 
> function calling checkNtfServerState(). The mutex usage in the 
> checkNtfServerState() is hidden and there is a risk that this function may be 
> placed within a protected area in the calling function. It is better to write 
> a note in the checkNtfServerState() function head clearly telling that this 
> function is not thread safe and has to be protected with the cb_lock mutex 
> and then use the mutex in the calling function.
>
> Also I found a lot more unprotected usage of the global ntfa_cb structure but 
> that is not directly related to the resilience update e.g a lot of functions 
> taking a pointer to this structure. I think a ticket for this should be 
> written now. The fix for checkNtfServerState() will just make sure that the 
> resilience patch don't add even more thread related issues.
>
> Thanks
> Lennart
>
>> -Original Message-
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 14 mars 2016 03:08
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen; Minh
>> Chau H
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent
>> [#1180] V3
>>
>>   osaf/libs/agents/saf/ntfa/ntfa.h  |   31 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_api.c  |  702
>> +++--
>>   osaf/libs/agents/saf/ntfa/ntfa_mds.c  |   14 +-
>>   osaf/libs/agents/saf/ntfa/ntfa_util.c |  465 +-
>>   4 files changed, 1057 insertions(+), 155 deletions(-)
>>
>>
>> The patch contains support for cloud resilience feature
>> in NTF Agent code. Please refer README.HYDRA for content
>> of the changes
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa.h
>> b/osaf/libs/agents/saf/ntfa/ntfa.h
>> --- a/osaf/libs/agents/saf/ntfa/ntfa.h
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa.h
>> @@ -91,6 +91,7 @@ typedef struct ntfa_filter_hdl_rec {
>>   typedef struct subscriberList {
>>  SaNtfHandleT subscriberListNtfHandle;
>>  SaNtfSubscriptionIdT subscriberListSubscriptionId;
>> +ntfsv_filter_ptrs_t filters; /* remember the filters used by this
>> subscriber */
>>  struct subscriberList *prev;
>>  struct subscriberList *next;
>>   } ntfa_subscriber_list_t;
>> @@ -100,6 +101,10 @@ typedef struct ntfa_reader_hdl_rec {
>>  unsigned int reader_id; /* handle value returned by NTFS
>> for this client */
>>  SaNtfHandleT ntfHandle;
>>  unsigned int reader_hdl;/* READER handle from handle
>> mgr */
>> +
>> +ntfsv_filter_ptrs_t filters

Re: [devel] [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent [#1180] V3

2016-03-19 Thread minh chau
Hi Lennart,

Please see my comment in line with [Minh]

Thanks,
Minh

On 15/03/16 21:05, Lennart Lund wrote:
> Hi Minh,
>
> I still think it is better to have all the mutex handling in the API 
> function. Mutex handling is actually not taken away from the API function by 
> doing it in a function called from the API function it's only hidden and the 
> risk of creating deadlocks increase. However in this case it is under control 
> and anyway the usage of global variables within the function is hidden as 
> well. I give it to you to decide.
[Minh] The reason I suggest that handling cb_lock mutex in separate 
function, because I have seen the deadlock reported in #1521, it should 
be generalized as below:
saNtfFinalize() in thread 1Unsubscribe() in thread 2
// take handle

// lock cb_lock

// give handle
// take handle

// lock cb_lock
-> I'm locked
// destroy handle
-> I'm locked too

I think the similar deadlock scenarios still there in other APIs 
(readFinalize vs readNext , ...), the reason is when locking cb (or take 
handle), the next is not unlocking cb (or give handle) respectively. And 
the fact that in current APIs code, a mix of cb_lock/handle usage where 
a lot of "go to" in the middle of function  would increase the risk of 
running into the above scenario. Separating cb_lock in function should 
be helpful.
The cb_lock is used to protect cb resource, 
subscriptionListAdd()/subscriberListItemRemove() are example of this. 
However, the code is not using cb_lock in consistent way to protect cb 
resource, ntfa_reader_hdl_rec_add/ntfa_reader_hdl_rec_del for instance, 
which increases the risk of above deadlock.
The hidden deadlock you mentioned I guess, it happens because *cb_lock - 
separate - function* called after locking cb. But this case should not 
happen given that it's *glanced* over in advance, like locking cb should 
not be called just before calling subscriptionListAdd()

So I think that protects cb variables of #1180 in separate function is 
not ideal to fit with current APIs code but it's acceptable in the scope 
of cloud resilience.

>
> I don't think handling a cb_lock in functions called from the API functions 
> is a solution for the future. Instead the global cb structure and other 
> global structures e.g. client structure , reader structure etc. should be 
> removed as global variables.
> Instead all variables should be owned by the functionality that actually use 
> them cf. private variables in C++.
> E.g. Client handles and client structures shall be owned by a client handler 
> that can do all needed operations related to a client. If this handler must 
> use locks it shall own its own mutex.
> There is no need to lock a whole cb structure with a lot of unrelated 
> variables if one client variable is read or changed. The client handler has 
> no interest in how MDS is initiated and what handles needed for that purpose 
> etc...
> When writing C-code, variables that must be shared by several functions 
> within a handler, can be isolated within that handler by giving the handler 
> its own file and not exposing the variables in any .h file. The handler .h 
> file shall only expose its interface which consists of functions but no 
> variables.
>
> This will make the API functions clean from any handling of mutexes.
[Minh] This raises a good idea that how to make API functions stay clean 
from mutexes handling. So we may need some sort of cb handler (or 
manager) that only exposes interfaces to be called in API functions. 
These interfaces give access (add/remove/...) to each specific cb 
resource (client_rec, filters, ...) , and has cb_lock protection at 
entry/exit point. The body of interfaces consist of actual functions 
that performing activities on cb resource.

Another refactoring I think it would be useful that the data structure 
relates to client_rec/reader_rec/subscriber_rec should reflect the 
object orientation as mentioned in specification, where client could be 
producer/reader/subscriber. Currently a client_rec contains a list of 
reader, but a list of subscriber is declared globally outside 
client_rec. Also, the reader_list declared in ntfa_cb_t structure seems 
not being used at all. I think refactoring object orientation would give 
a support to the idea of implementation of cb handler as above.

If this sounds right to you, there seems to be some things to do
>
> Thanks
> Lennart
>
>> -Original Message-
>> From: minh chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 15 mars 2016 02:05
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen
>> Cc

Re: [devel] [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent [#1180] V3

2016-03-19 Thread minh chau
Hi Lennart,

I guess it's an ACK from you? Vu and Praveen gave ACK on V2 and if don't 
have any comment on V3, can you please push it?

Thanks,
Minh

On 17/03/16 23:09, Lennart Lund wrote:
> Hi Minh
>
> See my comments inline [Lennart]
>
> Thanks
> Lennart
>
>> -----Original Message-
>> From: Minh Chau H
>> Sent: den 16 mars 2016 15:58
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen
>> Cc: opensaf-devel@lists.sourceforge.net; Anders Widell; Minh Chau H
>> Subject: Re: [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent
>> [#1180] V3
>>
>> Hi Lennart,
>>
>> Please see my comment in line with [Minh]
>>
>> Thanks,
>> Minh
>>
>> On 15/03/16 21:05, Lennart Lund wrote:
>>> Hi Minh,
>>>
>>> I still think it is better to have all the mutex handling in the API 
>>> function.
>> Mutex handling is actually not taken away from the API function by doing it 
>> in
>> a function called from the API function it's only hidden and the risk of
>> creating deadlocks increase. However in this case it is under control and
>> anyway the usage of global variables within the function is hidden as well. I
>> give it to you to decide.
>> [Minh] The reason I suggest that handling cb_lock mutex in separate
>> function, because I have seen the deadlock reported in #1521, it should
>> be generalized as below:
>>  saNtfFinalize() in thread 1Unsubscribe() in thread 2
>>  // take handle
>>
>>  // lock cb_lock
>>
>>  // give handle
>>  // take handle
>>
>>  // lock cb_lock
>>  -> I'm locked
>>  // destroy handle
>>  -> I'm locked too
>>
>> I think the similar deadlock scenarios still there in other APIs
>> (readFinalize vs readNext , ...), the reason is when locking cb (or take
>> handle), the next is not unlocking cb (or give handle) respectively. And
>> the fact that in current APIs code, a mix of cb_lock/handle usage where
>> a lot of "go to" in the middle of function  would increase the risk of
>> running into the above scenario. Separating cb_lock in function should
>> be helpful.
>> The cb_lock is used to protect cb resource,
>> subscriptionListAdd()/subscriberListItemRemove() are example of this.
>> However, the code is not using cb_lock in consistent way to protect cb
>> resource, ntfa_reader_hdl_rec_add/ntfa_reader_hdl_rec_del for instance,
>> which increases the risk of above deadlock.
>> The hidden deadlock you mentioned I guess, it happens because *cb_lock -
>> separate - function* called after locking cb. But this case should not
>> happen given that it's *glanced* over in advance, like locking cb should
>> not be called just before calling subscriptionListAdd()
>>
>> So I think that protects cb variables of #1180 in separate function is
>> not ideal to fit with current APIs code but it's acceptable in the scope
>> of cloud resilience.
>>
> [Lennart] I am OK with your decision
>>> I don't think handling a cb_lock in functions called from the API functions 
>>> is
>> a solution for the future. Instead the global cb structure and other global
>> structures e.g. client structure , reader structure etc. should be removed as
>> global variables.
>>> Instead all variables should be owned by the functionality that actually use
>> them cf. private variables in C++.
>>> E.g. Client handles and client structures shall be owned by a client handler
>> that can do all needed operations related to a client. If this handler must 
>> use
>> locks it shall own its own mutex.
>>> There is no need to lock a whole cb structure with a lot of unrelated
>> variables if one client variable is read or changed. The client handler has 
>> no
>> interest in how MDS is initiated and what handles needed for that purpose
>> etc...
>>> When writing C-code, variables that must be shared by several functions
>> within a handler, can be isolated within that handler by giving the handler 
>> its
>> own file and not exposing the variables in any .h file. The handler .h file 
>> shall
>> only expose its interface which consists of functions but no variables.
>>> This will make the API functions clean from any handling of mutexes.
>> [Minh] This raises a good idea that how to make API functi

Re: [devel] [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent [#1180] V3

2016-03-20 Thread minh chau
Hi Praveen, Vu

Any comment on V3?

Thanks,
Minh

On 18/03/16 19:01, Lennart Lund wrote:
> Hi Minh
>
> Ack, sorry for beeing a bit unclear
>
> Thanks
> Lennart
>
>> -Original Message-
>> From: minh chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 18 mars 2016 07:36
>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen
>> Cc: opensaf-devel@lists.sourceforge.net; Anders Widell
>> Subject: Re: [PATCH 2 of 5] NTF: Add support cloud resilience for NTF Agent
>> [#1180] V3
>>
>> Hi Lennart,
>>
>> I guess it's an ACK from you? Vu and Praveen gave ACK on V2 and if don't
>> have any comment on V3, can you please push it?
>>
>> Thanks,
>> Minh
>>
>> On 17/03/16 23:09, Lennart Lund wrote:
>>> Hi Minh
>>>
>>> See my comments inline [Lennart]
>>>
>>> Thanks
>>> Lennart
>>>
>>>> -----Original Message-
>>>> From: Minh Chau H
>>>> Sent: den 16 mars 2016 15:58
>>>> To: Lennart Lund; praveen.malv...@oracle.com; Vu Minh Nguyen
>>>> Cc: opensaf-devel@lists.sourceforge.net; Anders Widell; Minh Chau H
>>>> Subject: Re: [PATCH 2 of 5] NTF: Add support cloud resilience for NTF
>> Agent
>>>> [#1180] V3
>>>>
>>>> Hi Lennart,
>>>>
>>>> Please see my comment in line with [Minh]
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 15/03/16 21:05, Lennart Lund wrote:
>>>>> Hi Minh,
>>>>>
>>>>> I still think it is better to have all the mutex handling in the API 
>>>>> function.
>>>> Mutex handling is actually not taken away from the API function by doing
>> it in
>>>> a function called from the API function it's only hidden and the risk of
>>>> creating deadlocks increase. However in this case it is under control and
>>>> anyway the usage of global variables within the function is hidden as well.
>> I
>>>> give it to you to decide.
>>>> [Minh] The reason I suggest that handling cb_lock mutex in separate
>>>> function, because I have seen the deadlock reported in #1521, it should
>>>> be generalized as below:
>>>>   saNtfFinalize() in thread 1Unsubscribe() in thread 2
>>>>   // take handle
>>>>
>>>>   // lock cb_lock
>>>>
>>>>   // give handle
>>>>   // take handle
>>>>
>>>>   // lock cb_lock
>>>>   -> I'm locked
>>>>   // destroy handle
>>>>   -> I'm locked too
>>>>
>>>> I think the similar deadlock scenarios still there in other APIs
>>>> (readFinalize vs readNext , ...), the reason is when locking cb (or take
>>>> handle), the next is not unlocking cb (or give handle) respectively. And
>>>> the fact that in current APIs code, a mix of cb_lock/handle usage where
>>>> a lot of "go to" in the middle of function  would increase the risk of
>>>> running into the above scenario. Separating cb_lock in function should
>>>> be helpful.
>>>> The cb_lock is used to protect cb resource,
>>>> subscriptionListAdd()/subscriberListItemRemove() are example of this.
>>>> However, the code is not using cb_lock in consistent way to protect cb
>>>> resource, ntfa_reader_hdl_rec_add/ntfa_reader_hdl_rec_del for
>> instance,
>>>> which increases the risk of above deadlock.
>>>> The hidden deadlock you mentioned I guess, it happens because *cb_lock
>> -
>>>> separate - function* called after locking cb. But this case should not
>>>> happen given that it's *glanced* over in advance, like locking cb should
>>>> not be called just before calling subscriptionListAdd()
>>>>
>>>> So I think that protects cb variables of #1180 in separate function is
>>>> not ideal to fit with current APIs code but it's acceptable in the scope
>>>> of cloud resilience.
>>>>
>>> [Lennart] I am OK with your decision
>>>>> I don't think handling a cb_lock in functions called from the API 
>>>>> functions
>> is
>>>> a solution for the future. Instead the global cb structure and other glob

Re: [devel] [opensaf:tickets] #1620 amf: add support for 'cloud resilience' feature

2016-03-21 Thread minh chau
Hi Nagu,

Please see my comment inline

Thanks,
Minh

On 21/03/16 20:48, Nagendra Kumar wrote:
>
> TC #9 and TC #10: Configuration is SC-1, PL-3 and PL-4. SU1(Act) on 
> PL-3 and SU2(Std) on PL-4.
> Stop controller and then stop PL-3, start PL-3 and start controller. 
> SU2 will be Act and SU1 will be Standby.
> Stop controller and then stop PL-3, start PL-3 and start controller. 
> Amfd sends reboot to all nodes including itself.
> Two things:
> 1. Why there were "SU data inconsistency detected", when there are two 
> payloads.
>
[Minh] The PL-4 is the last veteran node, so IMM won't perform a sync. I 
think Zoran has mentioned while reviewing #1625 in mailing list
@Zoran: Can you please confirm this if I misunderstood, or there's 
something changed?
>
> 2. Since SG belonging to SU1 and SU2 are spread on PL-3 and PL-4, so 
> why to reboot SC-1 ?
>
[Minh] I have checked with Hans N, in this case cluster needs to be 
rebooted. There should be inconsistency between amfd and the other 
amfnd(s) than PL-4 where cluster has more PLs
> 
>
> *[tickets:#1620]  
> amf: add support for 'cloud resilience' feature*
>
> *Status:* review
> *Milestone:* 5.0.FC
> *Created:* Mon Dec 07, 2015 07:47 AM UTC by Gary Lee
> *Last Updated:* Mon Mar 21, 2016 08:47 AM UTC
> *Owner:* Minh Hon Chau
> *Attachments:*
>
>   * New TC 1.rar
> 
> 
> (245.9 kB; application/octet-stream)
>   * New TC 2.rar
> 
> 
> (268.3 kB; application/octet-stream)
>   * New TC 3.rar
> 
> 
> (398.2 kB; application/octet-stream)
>   * New TC 4.a.rar
> 
> 
> (411.5 kB; application/octet-stream)
>   * New TC 4.b.rar
> 
> 
> (410.1 kB; application/octet-stream)
>   * New TC 5.rar
> 
> 
> (250.1 kB; application/octet-stream)
>   * TC 1.rar
> 
> (233.9 kB; application/octet-stream)
>   * TC 2.rar
> 
> (243.5 kB; application/octet-stream)
>   * TC 21.rar
> 
> (127.3 kB; application/octet-stream)
>   * TC 22.rar
> 
> (315.3 kB; application/octet-stream)
>   * TC 24.rar
> 
> (250.3 kB; application/octet-stream)
>   * TC 25.rar
> 
> (251.6 kB; application/octet-stream)
>   * TC 26.rar
> 
> (272.7 kB; application/octet-stream)
>   * TC 27.rar
> 
> (238.6 kB; application/octet-stream)
>   * TC 28.rar
> 
> (198.3 kB; application/octet-stream)
>   * TC 29.rar
> 
> (275.6 kB; application/octet-stream)
>   * TC 30.rar
> 
> (265.5 kB; application/octet-stream)
>   * TC 31.rar
> 
> (259.4 kB; application/octet-stream)
>   * TC 32.rar
> 
> (450.2 kB; application/octet-stream)
>   * TC 33.rar
> 
> (246.3 kB; application/octet-stream)
>   * TC 43.rar
> 
> (1.1 kB; application/octet-stream)
>   * TC 45.rar
> 
> (195.3 kB; application/octet-stream)
>   * TC 46.rar
> 
> (248.1 kB; application/octet-stream)
>   * TC 47.rar
> 
> (475.1 kB; application/octet-stream)
>   * TC 48.rar
> 
> (247.0 kB; application/octet-stream)
>   * TC 49.rar
> 

Re: [devel] AMF PR doc for #1533.

2016-04-07 Thread minh chau
Ack with minor comment:
In first page, do we need to update " Release 4.7 Programmer's Reference 
October 2015"?

Thanks,
Minh

On 07/04/16 16:12, praveen malviya wrote:
> Hi All,
>
> Please find attached AMF PR doc updated for #1533.
> Only one minor change is done in section 2.2.10.1 Scope of Admin 
> operation on Nodegroup:
>
> Earlier text was:
> "Delete CCB operation is allowed on node group in LOCKED or UNLOCKED 
> admin state and ..."
>
> Now since nodegroup can be deleted in lock-in state:
> "Delete CCB operation is allowed on node group in LOCKED, UNLOCKED and 
> LOCK-IN admin state and ."
>
>
> Thanks,
> Praveen
>
> On 01-Mar-16 6:17 PM, Hans Nordebäck wrote:
>> ack, code review only/Thanks HansN
>>
>> On 01/20/2016 11:00 AM, praveen.malv...@oracle.com wrote:
>>> osaf/services/saf/amf/amfd/include/node.h |   3 +++
>>>   osaf/services/saf/amf/amfd/node.cc|   8 
>>>   osaf/services/saf/amf/amfd/nodegroup.cc   |  20 +++-
>>>   3 files changed, 10 insertions(+), 21 deletions(-)
>>>
>>>
>>> diff --git a/osaf/services/saf/amf/amfd/include/node.h
>>> b/osaf/services/saf/amf/amfd/include/node.h
>>> --- a/osaf/services/saf/amf/amfd/include/node.h
>>> +++ b/osaf/services/saf/amf/amfd/include/node.h
>>> @@ -144,6 +144,9 @@ class AVD_AVND {
>>> bool clm_change_start_preceded; /* to indicate there was CLM start
>>> cbk before CLM completed cb. */
>>> bool recvr_fail_sw; /* to indicate there was node reboot because
>>> of node failover/switchover.*/
>>> AVD_AMF_NG *admin_ng; /* points to the nodegroup on which admin
>>> operation is going on.*/
>>> +
>>> +  //Member functions.
>>> +  void node_sus_termstate_set(bool term_state) const;
>>>private:
>>> void initialize();
>>> // disallow copy and assign
>>> diff --git a/osaf/services/saf/amf/amfd/node.cc
>>> b/osaf/services/saf/amf/amfd/node.cc
>>> --- a/osaf/services/saf/amf/amfd/node.cc
>>> +++ b/osaf/services/saf/amf/amfd/node.cc
>>> @@ -1151,9 +1151,9 @@ void avd_node_admin_lock_unlock_shutdown
>>>*
>>>* @param node
>>>*/
>>> -static void node_sus_termstate_set(AVD_AVND *node, bool term_state)
>>> +void AVD_AVND::node_sus_termstate_set(bool term_state) const
>>>   {
>>> -for (const auto& su : node->list_of_su) {
>>> +for (const auto& su : list_of_su) {
>>>   if (su->saAmfSUPreInstantiable == true)
>>>   su->set_term_state(term_state);
>>>   }
>>> @@ -1326,7 +1326,7 @@ static void node_admin_op_cb(SaImmOiHand
>>>   goto done;
>>>   }
>>> -node_sus_termstate_set(node, true);
>>> +node->node_sus_termstate_set(true);
>>>   node_admin_state_set(node, 
>>> SA_AMF_ADMIN_LOCKED_INSTANTIATION);
>>>   if (node->node_info.member == false) {
>>> @@ -1374,7 +1374,7 @@ static void node_admin_op_cb(SaImmOiHand
>>>   goto done;
>>>   }
>>> -node_sus_termstate_set(node, false);
>>> +node->node_sus_termstate_set(false);
>>>   node_admin_state_set(node, SA_AMF_ADMIN_LOCKED);
>>>   if (node->node_info.member == false) {
>>> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc
>>> b/osaf/services/saf/amf/amfd/nodegroup.cc
>>> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
>>> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
>>> @@ -29,7 +29,6 @@ static AVD_AMF_NG *ng_create(SaNameT *dn
>>>   //TODO: Make  below function members.
>>>   static void ng_admin_unlock_inst(AVD_AMF_NG *ng);
>>>   static void ng_unlock(AVD_AMF_NG *ng);
>>> -static void node_sus_termstate_set(AVD_AVND *node, bool term_state);
>>>   /**
>>>* Lookup object in db using dn
>>> @@ -601,7 +600,7 @@ static void ng_ccb_apply_delete_hdlr(Ccb
>>>   if ((node->saAmfNodeAdminState ==
>>> SA_AMF_ADMIN_LOCKED_INSTANTIATION) ||
>>>   (any_ng_in_locked_in_state(node) == true))
>>>   continue;
>>> -node_sus_termstate_set(node, false);
>>> +node->node_sus_termstate_set(false);
>>>   }
>>>   //Instantiate SUs on nodes of NG. AMFD takes care of
>>> assignment after instantiation.
>>>   ng_admin_unlock_inst(ng);
>>> @@ -932,19 +931,6 @@ static void ng_unlock(AVD_AMF_NG *ng)
>>>   }
>>>   /**
>>> - * Set term_state for all pre-inst SUs hosted on the specified node
>>> - *
>>> - * @param node
>>> - */
>>> -static void node_sus_termstate_set(AVD_AVND *node, bool term_state)
>>> -{
>>> -for (const auto& su : node->list_of_su) {
>>> -if (su->saAmfSUPreInstantiable == true)
>>> -su->set_term_state(term_state);
>>> -}
>>> -}
>>> -
>>> -/**
>>>* perform unlock-instantiation on NG with honoring saAmfSURank.
>>>*
>>>* @param cb
>>> @@ -1079,7 +1065,7 @@ static void ng_admin_op_cb(SaImmOiHandle
>>>   node->admin_ng = ng;
>>>   if (node->saAmfNodeAdminState != SA_AMF_ADMIN_LOCKED)
>>>   continue;
>>> -node_sus_termstate_set(node, true);

Re: [devel] [PATCH 1 of 1] amfnd: return TRY_AGAIN for saAmfProtectionGroupTrack and saAmfProtectionGroupTrackStop while headless [#1718]

2016-04-07 Thread minh chau
Ack from me

Thanks,
Minh

On 07/04/16 13:53, Gary Lee wrote:
>   osaf/services/saf/amf/amfnd/pg.cc |  18 ++
>   1 files changed, 18 insertions(+), 0 deletions(-)
>
>
> return TRY_AGAIN for saAmfProtectionGroupTrack and 
> saAmfProtectionGroupTrackStop
> while headless, since protection group tracking requires amfd's presence
>
> diff --git a/osaf/services/saf/amf/amfnd/pg.cc 
> b/osaf/services/saf/amf/amfnd/pg.cc
> --- a/osaf/services/saf/amf/amfnd/pg.cc
> +++ b/osaf/services/saf/amf/amfnd/pg.cc
> @@ -147,6 +147,15 @@ uint32_t avnd_evt_ava_pg_start_evh(AVND_
>   
>   TRACE_ENTER();
>   
> + // if headless, return TRY_AGAIN to application
> + if (cb->is_avd_down == true) {
> + LOG_NO("Director is down. Return try again for PG start.");
> + rc = avnd_amf_resp_send(cb, AVSV_AMF_PG_START, 
> SA_AIS_ERR_TRY_AGAIN,
> + 0, &api_info->dest, &evt->mds_ctxt, 
> nullptr, false);
> + TRACE_LEAVE();
> + return rc;
> + }
> +
>   /*
>* Update pg db
>*/
> @@ -235,6 +244,15 @@ uint32_t avnd_evt_ava_pg_stop_evh(AVND_C
>   
>   TRACE_ENTER();
>   
> + // if headless, return TRY_AGAIN to application
> + if (cb->is_avd_down == true) {
> + LOG_NO("Director is down. Return try again for PG stop.");
> + rc = avnd_amf_resp_send(cb, AVSV_AMF_PG_STOP, 
> SA_AIS_ERR_TRY_AGAIN,
> + 0, &api_info->dest, &evt->mds_ctxt, 
> nullptr, false);
> + TRACE_LEAVE();
> + return rc;
> + }
> +
>   /* populate the track key */
>   key.mds_dest = api_info->dest;
>   key.req_hdl = pg_stop->hdl;
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 4 of 6] ntfa: support for returning SA_AIS_ERR_UNAVAILABLE on non-member node[#1639] V2

2016-04-07 Thread minh chau
Hi Praveen,

I see the latest ntfa_api.c code does not reserve SA_AIS_ERR_UNAVAILABLE 
when recovery client/reader/subscriber failed due to non-SA_AIS_OK rc in 
returned msg. Can you check whether this V2 was pushed?

Thanks,
Minh
On 29/03/16 20:02, praveen.malv...@oracle.com wrote:
>   osaf/libs/agents/saf/ntfa/ntfa.h  |2 +
>   osaf/libs/agents/saf/ntfa/ntfa_api.c  |  226 
> -
>   osaf/libs/agents/saf/ntfa/ntfa_mds.c  |   52 +++
>   osaf/libs/agents/saf/ntfa/ntfa_util.c |3 +
>   4 files changed, 272 insertions(+), 11 deletions(-)
>
>
> V2 changes:
> -Rebased over #1180 (Cloud resilience patch).
> -During headless state, OpenSAF may get stopped on payload with NTF app 
> running.
>   Since OpenSAF is not running on the payload, any A.01.02 NTF client should 
> not be served on
>   this node and this client should not be recovered. After first controller 
> comes up, A.01.02
>   client will not be recovered and application will get 
> SA_AIS_ERR_UNAVAILABLE upon which an
>   app can call saNtfFinalize() for freeing the resources.
>
>
> Changes include:
> -maintain SAF version.
> -minor version is updated from 01 to 02.
> -ntfa will get NTFSV_CLM_NODE_STATUS_CALLBACK from NTFS for membership status 
> of node.
> -check is included in all apis, excluding saNTfFinalize(), to return 
> SA_AIS_ERR_UNAVAILABLE
>   if node loses CLM membership.
>
> diff --git a/osaf/libs/agents/saf/ntfa/ntfa.h 
> b/osaf/libs/agents/saf/ntfa/ntfa.h
> --- a/osaf/libs/agents/saf/ntfa/ntfa.h
> +++ b/osaf/libs/agents/saf/ntfa/ntfa.h
> @@ -120,6 +120,7 @@ typedef struct ntfa_client_hdl_rec {
>   SYSF_MBX mbx;   /* priority q mbx b/w MDS & Library */
>   struct ntfa_client_hdl_rec *next;   /* next pointer for the list in 
> ntfa_cb_t */
>   bool valid; /* handle is valid if it's known by NTF server, 
> used for headless hydra */
> + bool is_stale_client;  /* Status of client based on the CLM status of 
> node.*/
>   SaVersionT version; /* the API version is being used by client, used 
> for recover after headless */
>   } ntfa_client_hdl_rec_t;
>   
> @@ -148,6 +149,7 @@ typedef struct {
>   SaUint32T ntf_var_data_limit;   /* max allowed variableDataSize */
>   /* NTF Server state */
>   ntfa_ntfsv_state_t ntfa_ntfsv_state;
> + SaClmClusterChangesT clm_node_state; /*Reflects CLM status of this 
> node(for future use).*/
>   } ntfa_cb_t;
>   
>   /* ntfa_saf_api.c */
> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_api.c 
> b/osaf/libs/agents/saf/ntfa/ntfa_api.c
> --- a/osaf/libs/agents/saf/ntfa/ntfa_api.c
> +++ b/osaf/libs/agents/saf/ntfa/ntfa_api.c
> @@ -966,7 +966,8 @@ SaAisErrorT reinitializeClient(ntfa_clie
>   }
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("info.api_resp_info.rc:%u", o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1033,7 +1034,8 @@ SaAisErrorT recoverReader(ntfa_client_hd
>   osafassert(o_msg != NULL);
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("o_msg->info.api_resp_info.rc:%u", 
> o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1108,7 +1110,8 @@ SaAisErrorT recoverSubscriber(ntfa_clien
>   
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("o_msg->info.api_resp_info.rc:%u", 
> o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1229,7 +1232,8 @@ SaAisErrorT saNtfInitialize(SaNtfHandleT
>   if ((version->releaseCode == NTF_RELEASE_CODE) && 
> (version->majorVersion <= NTF_MAJOR_VERSION) &&
>   (0 < version->majorVersion)) {
>   version->majorVersion = NTF_MAJOR_VERSION;
> - version->minorVersion = NTF_MINOR_VERSION;
> + if (version->minorVersion != NTF_MINOR_VERSION_0)
> + version->minorVersion = NTF_MINOR_VERSION;
>   } else {
>   TRACE("version FAILED, required: %c.%u.%u, supported: 
> %c.%u.%u\n",
> version->releaseCode, version->majorVersion, 
> version->minorVersion,
> @@ -1276,6 +1280,10 @@ SaAisErrorT saNtfInitialize(SaNtfHandleT
>   if (SA_AIS_OK != o_msg->info.api_resp_info.rc) {
>   rc = o_msg->info.api_resp_info.rc;
>   TRACE("NTFS return FAILED");
> + /*Check CLM membership of node.*/
> + if (rc == SA_AIS_ERR_UNAVAILABLE) {
> + TRACE("Node not CLM member or stale client");
> +  

Re: [devel] NTF PR doc update for #1639.

2016-04-07 Thread minh chau
Hi,

Ack with comments:
- The statement of 5) seems to be included in 2), maybe can combine 2) 
and 5)
- Since ERR_UNAVAILABLE could be returned in A.02.01, PR doc should 
recommend what's client's behavior when receiving this error code, eg. 
try API again since node has not joined cluster, or ...

Thanks,
Minh
On 07/04/16 16:01, praveen malviya wrote:
> Hi,
>
> Please find attached NTFS PR doc updated for #1639.
> Details of changes (done on top of changes of #1180):
>
> a) Modified highst supported version in section "2.2.3: Compliance
> Report" as:
>
> "Notification service includes generic saNtf.h for versions up to
> A.03.01 but current implementation supports A.01.02."
>
> b)Added section :
>
> "3.2.6 Intergration of NTFSv with CLMSv for return code
> SA_AIS_ERR_UNAVAILABLE
>
>  From OpenSAF 5.0 release, NTFSv gets integrated with CLMSv service.
> NTFSv, currently, conforms to A.01.01 SAF spec version. Integration of
> NTFSv service with CLMSv is not mentioned in A.01.01. In the later spec
> version viz A.02.01, section "3.13 Unavailability of the Notification
> Service API on a Non-Member Node" page no. 42 talks about integration of
> NTFSv with CLMSv. OpenSAF release 5.0, now supports this functionality
> for A.01.01 APIs via enhancement ticket
> https://sourceforge.net/p/opensaf/tickets/1639/
> This enhancement does not implement any new API of A.02.01.
>
> Important facts for existing users and new users :
> 1) Highest supported version for NTFSv is now A.01.02 (Minor version got
> updated).
> 2) A user initializing with A.01.01 (exact match) will not get
> SA_AIS_ERR_UNAVAILABLE as return code anytime for any API even on CLMSv
> Non-Member node. Such a user will get A.01.01 as returned version and
> not the highest supported version. This ensures backward compatibility
> for existing users.
> 3) An application trying to initialize other than A.01.01 will get
> returned version A.01.02 provided release Code and Major Version are
> respectively ‘A’ and ‘01’.
> 4) There is no impact on NTFSv integrated OpenSAF Middleware services
> (like AMF, PLMSv, CLMSv and SMFSv) as they initialize with NTFSv with
> A.01.01 version.
> 5) This enhancement introduces a minor deviation from spec that an
> application trying to initialize with A.01.01 will get A.01.01 as
> returned version not the highest supported version A.01.02.
>
> Thanks,
> Praveen
>
>
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] AMFND: Do not disable healthy SU [#1721]

2016-04-11 Thread minh chau
Hi Praveen

Please see comments inline

Thanks,
Minh

On 11/04/16 16:13, praveen malviya wrote:
>
> On 07-Apr-16 7:17 PM, Minh Hon Chau wrote:
>>   osaf/services/saf/amf/amfnd/su.cc |  5 -
>>   1 files changed, 0 insertions(+), 5 deletions(-)
>>
>>
>> Currently avnd_su_curr_info_del() is called in three places:
>>
>> (1). su restart recovery
>>
>> (2). su restart by admin op
>>
>> (3). su is terminated by su_pres_msg
>>
>> In case (1), (2), the code that reset SU's oper_state as DISABLED won't
>> be called.
>
> In the surestart recovery, AMFND disables SU in 
> avnd_err_rcvr_su_restart(). This has been there historically. Here it 
> is not conveyed to AMFD for surestart recovery.
[Minh]
So yes in avnd_err_rcvr_su_restart() amfnd has disabled SU, but before 
amfnd reports su_restart recovery, amfnd sets back to ENABLED

void su_send_suRestart_recovery_msg(AVND_SU *su)
{
 su->oper = SA_AMF_OPERATIONAL_ENABLED;
 //Keep the su enabled for sending the message.
 avnd_di_oper_send(avnd_cb, su, AVSV_ERR_RCVR_SU_RESTART);
 su->oper = SA_AMF_OPERATIONAL_DISABLED;
}

Regarding the patch, its change is inside avnd_su_curr_info_del.
In case of su restart recovery, the code removed in this patch will not 
be reached since su is marked as FAILED in avnd_err_rcvr_su_restart()
In case of admin su restart, this code will not be reached as well.
The other case is terminate SU, but SU can't be DISABLED while it's not 
FAILED?

>
> Thanks,
> Praveen
>  Only in (3), which lock-in SU (or node/ng) which SU is not
>> failed, that reset SU's oper_state as DISABLED. This will set local
>> variable @su->oper as DISABLED while SU OperationalState in amfd and
>> imm as ENABLED. This reset is not needed since if SU is healthy, its
>> oper state should be ENABLED. And this reset will cause SU won't be able
>> to recover after headless if there was a lock-in SU (node/ng) done 
>> before
>> headless.
>>
>> Patch removes this reset of SU as DISABLED.
>>
>> diff --git a/osaf/services/saf/amf/amfnd/su.cc 
>> b/osaf/services/saf/amf/amfnd/su.cc
>> --- a/osaf/services/saf/amf/amfnd/su.cc
>> +++ b/osaf/services/saf/amf/amfnd/su.cc
>> @@ -572,11 +572,6 @@ uint32_t avnd_su_curr_info_del(AVND_CB *
>>   su->su_restart_cnt = 0;
>>   avnd_di_uns32_upd_send(AVSV_SA_AMF_SU, 
>> saAmfSURestartCount_ID, &su->name, su->su_restart_cnt);
>>   /* stop su_err_esc_tmr TBD Later */
>> -
>> -/* disable the oper state (if pi su) */
>> -if (m_AVND_SU_IS_PREINSTANTIABLE(su) && (su->admin_op_Id != 
>> SA_AMF_ADMIN_RESTART)) {
>> -m_AVND_SU_OPER_STATE_SET(su, SA_AMF_OPERATIONAL_DISABLED);
>> -}
>>   }
>>
>>   /* scan & delete the current info store in each component */
>>
>


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] Review request for NTF: Update PR doc for cloud resilience [#1707] V3

2016-04-12 Thread minh chau
Hi Lennart

Agree with your comments, except the following one, I copy it out here 
for easier follow up:
/
//3.2.5.2Behavior after headless (client recovery) 

After headless, client recovery procedure is only started when there is 
communication with NTF server required to re-establish, which could be 
executed as instance of consumer (subscriber, reader) or producer (sender)/

[Lennart] Clients will be automatically recovered when the agent has 
established contact with the NTF server:
[Minh] Your comment seems not align with the current implementation, it 
sounds like agent is about starting recovery thread right after contact 
with sever is established. My text is not clear also, I mean recovery is 
only initiated when client Id/subsciptionId/ ReaderId are required 
re-establish with server after headless. There are many other APIs that 
do not need server's presence and those ones can be operational 
with/without server (allocate/free).

I guess it should be written again like this

///3.2.5.2Behavior after headless (client recovery) ///
//
/After headless, client recovery procedure is only started when client 
needs to reintroduce itself to NTF server, which could be executed as 
instance of consumer (subscriber, reader) or producer (sender)/


How does this sound to you?
/
/Thanks,
Minh/
/

On 11/04/16 21:49, Lennart Lund wrote:
>
> Hi Minh
>
> I still have some comment. Please see attached document
>
> Thanks
>
> Lennart
>
> *From:*minh chau [mailto:minh.c...@dektech.com.au]
> *Sent:* den 11 april 2016 08:20
> *To:* Lennart Lund; Vu Minh Nguyen; 'praveen malviya'
> *Cc:* opensaf-devel@lists.sourceforge.net; Jorge Pacheco Garcia
> *Subject:* Re: [devel] Review request for NTF: Update PR doc for cloud 
> resilience [#1707] V3
>
> Hi Lennart,
>
> I have changed a bit in item 3.2.5 with your idea, please have a look.
>
> Thanks,
> Minh
>
> ---
>
>
>   3.2.5 Support Loss of Both System Controller Nodes
>
> Loss of both system controller(SC) nodes means that the NTF service 
> has lost its server (director). This situation will be named as 
> “headless” in the rest of this document.
>
> After headless state, all information that existed in the server on 
> the SC-nodes is lost. When the SC node(s) are started again after 
> being in the headless state, the NTF agent will send information about 
> existing clients to the server so they can be restored.
>
> Recovery procedure will include the following client's information:
>
> ·ntfHandle
>
> ·Notification subscriptions including filters
>
> ·Reader handle
>
> However, all logged notifications will be lost. This means that if 
> reader handle is restored, reader will not be able to read 
> notifications before headless state.
>
>
> 3.2.5.1Behavior during headless
>
> The following APIs will return SA_AIS_ERR_TRY_AGAIN until NTF server 
> is available again, which are: saNtfInitialize, saNtfNotificationSend, 
> saNtfNotificationSubscribe, saNtfNotificationReadInitialize, and 
> saNtfNotificationReadNext.
>
> All remaining APIs (including SaNtfFinalize, 
> saNtfNotificationUnsubcribe and saNtfNotificationReadFinalize) can be 
> used during headless.
>
>
> 3.2.5.2 Behavior after headless (client recovery)
>
> After headless, client recovery procedure is only started when there 
> is communication with NTF server required to re-establish, which could 
> be executed as instance of consumer (subscriber, reader) or producer 
> (sender)
>
> ·As subscriber: Recovery is started as soon as NTF Agent detects that 
> NTF Server is up after headless. NTF Agent will send a dummy callback 
> to subscriber's mailbox to trigger saNtfDispatch call. From 
> saNtfDispatch, saNtfNotificationSubscribe, or 
> saNtfNotificationUnsubscribe, Agent will start recovery .
>
> ·As reader: Recovery is started if client calls 
> saNtfNotificationReadInitialize, saNtfNotificationReadNext, or 
> saNtfNotificationReadFinalize.
>
> ·As sender: Recovery is started if client calls saNtfNotificationSend .
>
> Once recovery succeeds, client can continue using existing handles to 
> read or send notifications as well as receive subscribed notifications.
>
> If recovery fails, the corresponding handle will be invalidated. This 
> mean that client will get SA_AIS_ERR_BAD_HANDLE. If this happen client 
> must initialize new handle.
>
> On 08/04/16 23:15, Lennart Lund wrote:
>
> Hi Minh,
>
> The intended reader of this document is a user of the NTF service. This 
> means that only information that has direct impact on the user should be 
> written here.
>
> The easiest way for me to explain what I me

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

2016-04-12 Thread minh chau
Hi Praveen

NTF server also accepts initialize request (and here it comes from 
reinitializeClient() after headless) if NTF server has not initialized 
with CLM.
So after headless, this situation will most likely happen. The recovery 
would succeeds, but after that what if NTF server notifies the agent it 
is not longer a member, could a subscriber be waiting for notification 
while agent is not a member anymore?

Thanks,
Minh

On 11/04/16 15:46, praveen.malv...@oracle.com wrote:
>   osaf/libs/agents/saf/ntfa/ntfa_api.c |  28 ++--
>   1 files changed, 18 insertions(+), 10 deletions(-)
>
>
> During headless state, OpenSAF may get stopped on payload with NTF app 
> running.
> Since OpenSAF is not running on the payload, any A.01.02 NTF client should 
> not be served on
> this node and this client should not be recovered. After first controller 
> comes up, A.01.02
> client will not be recovered and application will get SA_AIS_ERR_UNAVAILABLE 
> upon which an
> app can call saNtfFinalize() for freeing the resources.
>
> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_api.c 
> b/osaf/libs/agents/saf/ntfa/ntfa_api.c
> --- a/osaf/libs/agents/saf/ntfa/ntfa_api.c
> +++ b/osaf/libs/agents/saf/ntfa/ntfa_api.c
> @@ -966,7 +966,8 @@ SaAisErrorT reinitializeClient(ntfa_clie
>   }
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("info.api_resp_info.rc:%u", o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1033,7 +1034,8 @@ SaAisErrorT recoverReader(ntfa_client_hd
>   osafassert(o_msg != NULL);
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("o_msg->info.api_resp_info.rc:%u", 
> o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1108,7 +1110,8 @@ SaAisErrorT recoverSubscriber(ntfa_clien
>   
>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>   TRACE("o_msg->info.api_resp_info.rc:%u", 
> o_msg->info.api_resp_info.rc);
> - rc = SA_AIS_ERR_BAD_HANDLE;
> + if (rc != SA_AIS_ERR_UNAVAILABLE)
> + rc = SA_AIS_ERR_BAD_HANDLE;
>   goto done;
>   }
>   
> @@ -1437,7 +1440,7 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
>   if (!hdl_rec->valid) {
>   /* recovery */
>   if ((rc = recoverClient(hdl_rec)) != SA_AIS_OK) {
> - if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc == 
> SA_AIS_ERR_UNAVAILABLE)) {
> + if (rc == SA_AIS_ERR_BAD_HANDLE) {
>   ncshm_give_hdl(ntfHandle);
>   osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) 
> == 0);
>   ntfa_hdl_rec_force_del(&ntfa_cb.client_list, 
> hdl_rec);
> @@ -1445,6 +1448,11 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
>   ntfa_shutdown(false);
>   goto done;
>   }
> + if (rc == SA_AIS_ERR_UNAVAILABLE) {
> + TRACE("Node not CLM member or stale client");
> + ncshm_give_hdl(ntfHandle);
> + goto done;
> + }
>   }
>   }
>   
> @@ -1807,7 +1815,7 @@ SaAisErrorT saNtfNotificationSend(SaNtfN
>   if ((rc = recoverClient(client_rec)) != SA_AIS_OK) {
>   ncshm_give_hdl(client_handle);
>   ncshm_give_hdl(notificationHandle);
> - if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc == 
> SA_AIS_ERR_UNAVAILABLE)) {
> + if (rc == SA_AIS_ERR_BAD_HANDLE) {
>   osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) 
> == 0);
>   ntfa_hdl_rec_force_del(&ntfa_cb.client_list, 
> client_rec);
>   
> osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
> @@ -2153,7 +2161,7 @@ SaAisErrorT saNtfNotificationSubscribe(c
>   if (notificationFilterHandles->alarmFilterHandle)
>   
> ncshm_give_hdl(notificationFilterHandles->alarmFilterHandle);
>   }
> - if (recovery_failed && ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc == 
> SA_AIS_ERR_UNAVAILABLE))) {
> + if (recovery_failed && (rc == SA_AIS_ERR_BAD_HANDLE)) {
>   osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>   ntfa_hdl_rec_force_del(&ntfa_cb.client_list, client_hdl_rec);
>   osafassert(pthread_mutex_unlock(&ntfa_cb.cb_lock) == 0);
> @@ -3355,7 +3363,7 @@ SaAisErrorT saNtfNotificationUnsubscribe
>   
>   if (!client_hdl_rec->valid && getServerStat

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

2016-04-12 Thread minh chau


On 12/04/16 21:49, praveen malviya wrote:
>
>
> On 12-Apr-16 3:56 PM, minh chau wrote:
>> Hi Praveen
>>
>> NTF server also accepts initialize request (and here it comes from
>> reinitializeClient() after headless) if NTF server has not initialized
>> with CLM.
>> So after headless, this situation will most likely happen. The recovery
>> would succeeds, but after that what if NTF server notifies the agent it
>> is not longer a member, could a subscriber be waiting for notification
>> while agent is not a member anymore?
>>
> There is only one event that can lead to this and that is OpenSAF stop 
> on the node as admin operations are not available in headless state. 
> But this is the limitation of whole headless solution in every service 
> as there is no recovery of CLM status of client node at each director 
> and also recovery of clients is being done very early at MDS up event 
> of the service.
>
[Minh] Actually, in non-headless this situation also happens. When 
client is subscribing for notification, lock a clm node. This client 
will not be informed error code SA_AIS_ERR_UNAVAILABLE if its filter 
does not match to any notifications. It has to wait until clm node is 
unlocked and there is notification to come, so saNtfDispatch will return 
SA_AIS_ERR_UNAVAILABLE. But if filter does not match, this client will 
be waiting and can't finalize handle.
If this situation is solved in non-headless, the problem stated above in 
headless should also be solved by the same solution.

Another issue but not relate to this ticket, that ntftool does not 
handle SA_AIS_ERR_UNAVAILABLE. I get ntfsubscriber indefinite loop in 
calling saNtfDispatch() when ntfsubscriber receives SA_AIS_ERR_UNAVAILABLE.

Thanks,
Minh
>
> Thanks,
> Praveen
>> Thanks,
>> Minh
>>
>> On 11/04/16 15:46, praveen.malv...@oracle.com wrote:
>>>   osaf/libs/agents/saf/ntfa/ntfa_api.c |  28 
>>> ++--
>>>   1 files changed, 18 insertions(+), 10 deletions(-)
>>>
>>>
>>> During headless state, OpenSAF may get stopped on payload with NTF app
>>> running.
>>> Since OpenSAF is not running on the payload, any A.01.02 NTF client
>>> should not be served on
>>> this node and this client should not be recovered. After first
>>> controller comes up, A.01.02
>>> client will not be recovered and application will get
>>> SA_AIS_ERR_UNAVAILABLE upon which an
>>> app can call saNtfFinalize() for freeing the resources.
>>>
>>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>> b/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>> --- a/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>> +++ b/osaf/libs/agents/saf/ntfa/ntfa_api.c
>>> @@ -966,7 +966,8 @@ SaAisErrorT reinitializeClient(ntfa_clie
>>>   }
>>>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>   TRACE("info.api_resp_info.rc:%u",
>>> o_msg->info.api_resp_info.rc);
>>> -rc = SA_AIS_ERR_BAD_HANDLE;
>>> +if (rc != SA_AIS_ERR_UNAVAILABLE)
>>> +rc = SA_AIS_ERR_BAD_HANDLE;
>>>   goto done;
>>>   }
>>> @@ -1033,7 +1034,8 @@ SaAisErrorT recoverReader(ntfa_client_hd
>>>   osafassert(o_msg != NULL);
>>>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>   TRACE("o_msg->info.api_resp_info.rc:%u",
>>> o_msg->info.api_resp_info.rc);
>>> -rc = SA_AIS_ERR_BAD_HANDLE;
>>> +if (rc != SA_AIS_ERR_UNAVAILABLE)
>>> +rc = SA_AIS_ERR_BAD_HANDLE;
>>>   goto done;
>>>   }
>>> @@ -1108,7 +1110,8 @@ SaAisErrorT recoverSubscriber(ntfa_clien
>>>   if ((rc = o_msg->info.api_resp_info.rc) != SA_AIS_OK) {
>>>   TRACE("o_msg->info.api_resp_info.rc:%u",
>>> o_msg->info.api_resp_info.rc);
>>> -rc = SA_AIS_ERR_BAD_HANDLE;
>>> +if (rc != SA_AIS_ERR_UNAVAILABLE)
>>> +rc = SA_AIS_ERR_BAD_HANDLE;
>>>   goto done;
>>>   }
>>> @@ -1437,7 +1440,7 @@ SaAisErrorT saNtfDispatch(SaNtfHandleT n
>>>   if (!hdl_rec->valid) {
>>>   /* recovery */
>>>   if ((rc = recoverClient(hdl_rec)) != SA_AIS_OK) {
>>> -if ((rc == SA_AIS_ERR_BAD_HANDLE) || (rc ==
>>> SA_AIS_ERR_UNAVAILABLE)) {
>>> +if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>   ncshm_give_hdl(ntfHandle);
>>> osafassert(pthread_mutex_lock(&ntfa_cb.cb_lock) == 0);
>

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

2016-04-13 Thread minh chau


On 13/04/16 15:43, praveen malviya wrote:
>
>
> On 12-Apr-16 10:24 PM, minh chau wrote:
>>
>>
>> On 12/04/16 21:49, praveen malviya wrote:
>>>
>>>
>>> On 12-Apr-16 3:56 PM, minh chau wrote:
>>>> Hi Praveen
>>>>
>>>> NTF server also accepts initialize request (and here it comes from
>>>> reinitializeClient() after headless) if NTF server has not initialized
>>>> with CLM.
>>>> So after headless, this situation will most likely happen. The 
>>>> recovery
>>>> would succeeds, but after that what if NTF server notifies the 
>>>> agent it
>>>> is not longer a member, could a subscriber be waiting for notification
>>>> while agent is not a member anymore?
>>>>
>>> There is only one event that can lead to this and that is OpenSAF stop
>>> on the node as admin operations are not available in headless state.
>>> But this is the limitation of whole headless solution in every service
>>> as there is no recovery of CLM status of client node at each director
>>> and also recovery of clients is being done very early at MDS up event
>>> of the service.
>>>
>> [Minh] Actually, in non-headless this situation also happens. When
>> client is subscribing for notification, lock a clm node. This client
>> will not be informed error code SA_AIS_ERR_UNAVAILABLE if its filter
>> does not match to any notifications. It has to wait until clm node is
>> unlocked and there is notification to come, so saNtfDispatch will return
>> SA_AIS_ERR_UNAVAILABLE. But if filter does not match, this client will
>> be waiting and can't finalize handle.
>> If this situation is solved in non-headless, the problem stated above in
>> headless should also be solved by the same solution.
>>
> [Praveen]Not only in NTFSv, same logic of waiting for an event to get 
> unblocked from poll() is valid for all the other services applications 
> also as all SAF services are integrated with CLMSv. I do not know 
> whether one should poll indefinitely or not and in case of finite poll 
> time what an application must do after poll times out.
>
> But I think, from SAF perspective still this cannot be classified as a 
> problem. The reason is any such application's life cycle is monitored 
> by AMF and AMF terminates such process as part of CLM node eviction. 
> Also CLM provides traker interface for this purpose only.
> At the same time, I have observed that for ERR_UNAVAILABLE AMF spec is 
> particularly more clear as it states on section 7.2.1 on page 243
> 
> However, there are a few special situations in which processes may 
> call Availability Management Framework API functions.
> • An Availability Management Framework API function is called by a 
> process nearly at the same time when the node exits the cluster and 
> the Availability Management Framework area server on the node has not 
> yet terminated the process.
> ..
> =
> And for above mentioned cases AMF will return ERR_UNAVAILABLE.So it 
> seems ERR_UNAVAILABLE is meant for such special cases.So any 
> application must rely on its own subscription to CLMSv. Or Admin will 
> have to take care of this.
> I will check other SAF documents like Cprogramming doc and overview 
> doc if something in this context is mentioned.
[Minh] I think application can be purely NTF client only which does not 
have to initialize with AMF, or maybe I don't understand your idea.
Let's look at this example: Running subscriber with filter "ABC", lock 
CLM node, unlock CLM node again. Then some applications in cluster raise 
notification ABC.
With current implementation, this subscriber get notified 
ERR_UNAVAILABLE when notification ABC coming to its mailbox, thus it 
eventually lost this notification ABC.
But if NTF notified ERR_UNAVAILABLE after locking CLM node, this 
subscriber can earlier finalize its handle with NTF. It can wait by 
somehow until CLM node is unlocked again, or it can initialize CLMsv to 
know when a node becoming a member again. After unlock CLM as above 
example, this subscriber is ready to receive notification and when 
notification ABC comes, subscriber can receive it. And I guess this is 
the idea mentioned in NTF spec:

/"If the cluster node rejoins the cluster membership, processes 
executing on the cluster node will be able to reinitialize new library 
handles and use the entire set of Notification Service APIs that operate 
on these new handles; however, invocation of APIs that operate on 
handles acquired by any process before the cluster node left the 
membership will continue to fail with

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

2016-04-20 Thread minh chau
Hi Praveen,

Would you think about quick patch that notify client's mailbox a dummy 
callback after Agent detect it's non-member, so NTF client can finalize 
handle right after that. Otherwise as below your explanation, there will 
be implicit dependency of NTF user on AMF or CLM in this case, and that 
should be documented.

Thanks,
Minh

On 14/04/16 07:01, minh chau wrote:
>
>
> On 13/04/16 15:43, praveen malviya wrote:
>>
>>
>> On 12-Apr-16 10:24 PM, minh chau wrote:
>>>
>>>
>>> On 12/04/16 21:49, praveen malviya wrote:
>>>>
>>>>
>>>> On 12-Apr-16 3:56 PM, minh chau wrote:
>>>>> Hi Praveen
>>>>>
>>>>> NTF server also accepts initialize request (and here it comes from
>>>>> reinitializeClient() after headless) if NTF server has not 
>>>>> initialized
>>>>> with CLM.
>>>>> So after headless, this situation will most likely happen. The 
>>>>> recovery
>>>>> would succeeds, but after that what if NTF server notifies the 
>>>>> agent it
>>>>> is not longer a member, could a subscriber be waiting for 
>>>>> notification
>>>>> while agent is not a member anymore?
>>>>>
>>>> There is only one event that can lead to this and that is OpenSAF stop
>>>> on the node as admin operations are not available in headless state.
>>>> But this is the limitation of whole headless solution in every service
>>>> as there is no recovery of CLM status of client node at each director
>>>> and also recovery of clients is being done very early at MDS up event
>>>> of the service.
>>>>
>>> [Minh] Actually, in non-headless this situation also happens. When
>>> client is subscribing for notification, lock a clm node. This client
>>> will not be informed error code SA_AIS_ERR_UNAVAILABLE if its filter
>>> does not match to any notifications. It has to wait until clm node is
>>> unlocked and there is notification to come, so saNtfDispatch will 
>>> return
>>> SA_AIS_ERR_UNAVAILABLE. But if filter does not match, this client will
>>> be waiting and can't finalize handle.
>>> If this situation is solved in non-headless, the problem stated 
>>> above in
>>> headless should also be solved by the same solution.
>>>
>> [Praveen]Not only in NTFSv, same logic of waiting for an event to get 
>> unblocked from poll() is valid for all the other services 
>> applications also as all SAF services are integrated with CLMSv. I do 
>> not know whether one should poll indefinitely or not and in case of 
>> finite poll time what an application must do after poll times out.
>>
>> But I think, from SAF perspective still this cannot be classified as 
>> a problem. The reason is any such application's life cycle is 
>> monitored by AMF and AMF terminates such process as part of CLM node 
>> eviction. Also CLM provides traker interface for this purpose only.
>> At the same time, I have observed that for ERR_UNAVAILABLE AMF spec 
>> is particularly more clear as it states on section 7.2.1 on page 243
>> 
>> However, there are a few special situations in which processes may 
>> call Availability Management Framework API functions.
>> • An Availability Management Framework API function is called by a 
>> process nearly at the same time when the node exits the cluster and 
>> the Availability Management Framework area server on the node has not 
>> yet terminated the process.
>> ..
>> =
>> And for above mentioned cases AMF will return ERR_UNAVAILABLE.So it 
>> seems ERR_UNAVAILABLE is meant for such special cases.So any 
>> application must rely on its own subscription to CLMSv. Or Admin will 
>> have to take care of this.
>> I will check other SAF documents like Cprogramming doc and overview 
>> doc if something in this context is mentioned.
> [Minh] I think application can be purely NTF client only which does 
> not have to initialize with AMF, or maybe I don't understand your idea.
> Let's look at this example: Running subscriber with filter "ABC", lock 
> CLM node, unlock CLM node again. Then some applications in cluster 
> raise notification ABC.
> With current implementation, this subscriber get notified 
> ERR_UNAVAILABLE when notification ABC coming to its mailbox, thus it 
> eventually lost this notification ABC.
> But if NTF notified ERR_UNAVAILABLE after locking CLM node, t

Re: [devel] [PATCH 1 of 1] ntfa: return ERR_UNAVAILABLE on non-member node after headless state [#1744]

2016-04-21 Thread minh chau
Hi,

The addon patch at least can help existing NTF subscriber quickly 
finalizes as soon as node becomes non-member, this should not be late 
informed until the node rejoin cluster as current #1744. The next step 
how to detect NTF service available again depends on whether client is 
pure NTF application (like ntftool) or a SAF application.
So it's ack from me for #1744 + addon patch.

Tahnks,
Minh
On 21/04/16 17:27, praveen malviya wrote:
> Hi Minh,
>
> Return code ERR_UNAVAILABLE is not an indication for any client that 
> node has lost CLM membership because same return code is given for a 
> stale client when node again becomes member.
>  Also when node loses membership and a client gets ERR_UNAVAILABLE, it 
> will finalize all handles. After finalizing again client needs an 
> indication that node has joined the membership (it cannot try in while 
> loop for ERR_UNAVAILABLE on sa<*>initialize()). Such an indication, 
> this client an get only when it is a client of CLM also with tracker 
> interface. Tracker interface APIs works on non-menber nodes also but 
> it gives only local node information on non member node and such a 
> client needs only this much information. So upon receiving CLM 
> callback for local node joining, this client will go and call 
> sa,*>Initialize() and this call will succeed. In this way,I think, for 
> normal cluster it is the responsibility of client process to detect 
> node membership status by becoming CLM client also.
>
> In headless state, CLM member ship status of client nodes is not 
> remember by directors and they will have to rely on new CLM callbacks 
> after first controller comes up. At the same time CLM client will also 
> get BAD_HANDLE after first controller comes up. Considering this 
> situation, attached patch 1744_addon.patch will give dummy event to 
> client so that it can call saNtfDispatch. It will work for both 
> headless and non-headless cluster. But this topic can be revisited a) 
> 5.1 when all services only on CLM indication or b)when we have more 
> clarity on CLM status of nodes during headless.
>
> Attached patch in on top of #1744 patch and it fixes ntfsubscribe also 
> to call saNTfFinalise() and exit on receiving ERR_UNAVAILABLE.I would 
> like to push 1744 before RC2.
>
> Thanks,
> Praveen
>
>
>
>
> On 21-Apr-16 5:38 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Would you think about quick patch that notify client's mailbox a dummy
>> callback after Agent detect it's non-member, so NTF client can finalize
>> handle right after that. Otherwise as below your explanation, there will
>> be implicit dependency of NTF user on AMF or CLM in this case, and that
>> should be documented.
>
>>
>> Thanks,
>> Minh
>>
>> On 14/04/16 07:01, minh chau wrote:
>>>
>>>
>>> On 13/04/16 15:43, praveen malviya wrote:
>>>>
>>>>
>>>> On 12-Apr-16 10:24 PM, minh chau wrote:
>>>>>
>>>>>
>>>>> On 12/04/16 21:49, praveen malviya wrote:
>>>>>>
>>>>>>
>>>>>> On 12-Apr-16 3:56 PM, minh chau wrote:
>>>>>>> Hi Praveen
>>>>>>>
>>>>>>> NTF server also accepts initialize request (and here it comes from
>>>>>>> reinitializeClient() after headless) if NTF server has not
>>>>>>> initialized
>>>>>>> with CLM.
>>>>>>> So after headless, this situation will most likely happen. The
>>>>>>> recovery
>>>>>>> would succeeds, but after that what if NTF server notifies the
>>>>>>> agent it
>>>>>>> is not longer a member, could a subscriber be waiting for
>>>>>>> notification
>>>>>>> while agent is not a member anymore?
>>>>>>>
>>>>>> There is only one event that can lead to this and that is OpenSAF 
>>>>>> stop
>>>>>> on the node as admin operations are not available in headless state.
>>>>>> But this is the limitation of whole headless solution in every 
>>>>>> service
>>>>>> as there is no recovery of CLM status of client node at each 
>>>>>> director
>>>>>> and also recovery of clients is being done very early at MDS up 
>>>>>> event
>>>>>> of the service.
>>>>>>
>>>>> [Minh] Actually, in non-headless this situation also happens. When
>>>>> client is subscribing for notification, lock a clm node.

Re: [devel] [PATCH 1 of 1] amfnd: mark SU RESTARTING in comp FSM during restart of comp(s) [#1752]

2016-04-26 Thread minh chau
Tested the patch, ack with minor comment, please see inline
Thanks,
Minh

On 22/04/16 19:36, praveen.malv...@oracle.com wrote:
>   osaf/services/saf/amf/amfnd/clc.cc|  85 
> ++-
>   osaf/services/saf/amf/amfnd/include/avnd_su.h |   1 +
>   osaf/services/saf/amf/amfnd/susm.cc   |   2 +-
>   3 files changed, 86 insertions(+), 2 deletions(-)
>
>
> In reported problem, AMFD does not send state change notification for
> 1)SU presence state change from INSTANTIATED to RESTARTING and
> 2)SU presence state change from RESTARTING to INSTANTIATED
> when component restarts due to RESTART admin op on it or faults with 
> comp-restart recovery.
>
> As per AMF spec, presence state of SU will be RESTARTING when all of its comp 
> are in RESTARTING
> presence state. In this case when comp restarts due to fault or RESTART admin 
> op on it, AMFND
> is not marking SU's presence state RESTARTING and SU remains in INSTANTIATED 
> state. Since there
> is no change in presence state of SU, no state change notification for SU is 
> sent.
>
> Patch fixes the problem by marking SU's presence state to RESTARTING when:
> 1)SU consists of a single restartable component and this comp restarts due to 
> fault or RESTART
> admin op on it.
> 2)SU consists of all restartable components and all these components faults 
> simultaneously.
>
> diff --git a/osaf/services/saf/amf/amfnd/clc.cc 
> b/osaf/services/saf/amf/amfnd/clc.cc
> --- a/osaf/services/saf/amf/amfnd/clc.cc
> +++ b/osaf/services/saf/amf/amfnd/clc.cc
> @@ -993,7 +993,36 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>   if ((SA_AMF_PRESENCE_TERMINATING == prv_st) && 
> (SA_AMF_PRESENCE_TERMINATION_FAILED == final_st)) {
>   /* termination failed.. log it */
>   }
> -
> + //Instantiated -> Restarting.
> + if ((prv_st == SA_AMF_PRESENCE_INSTANTIATED) && (final_st == 
> SA_AMF_PRESENCE_RESTARTING)) {
> + /*
> +This presence state transition involving RESTARTING 
> state may originate
> +with or without any SU FSM event :
> + a)Without SU FSM event: when component is 
> restartable
> +and event is fault or RESTART admin op on 
> component.
> + b)With SU FSM event: when all comps are 
> restartable and RESTART
> +   admin op on SU.
> +In the case b) SU FSM takes care of moving presence 
> state of SU to
> +RESTARTING when all of its components (all 
> restartable) are
> +in RESTARTING state at any given point of time.
> +
> +In case a), SU FSM never gets triggered because 
> restart of component
> +is totally restricted to comp FSM. So in this case 
> comp FSM itself
> +will have to mark SU's presence state RESTARTING 
> whenever all the components
> +are in Restarting state. This can occur in:
> + -confs with single restartable component 
> because of fault with comp-restart
> +  recovery or RESTART admin op on comp. OR
> + -confs with all comp restartable when all comps 
> faults with comp-restart recovery.
> +So if I am here because of case a) check if I can 
> mark SU RESTARTING.
> +  */
> + if ((isRestartSet(comp->su) == false) &&
> + ((m_AVND_COMP_IS_FAILED(comp)) || 
> (comp->admin_oper == true)) &&
> + (su_evaluate_restarting_state(comp->su) == 
> true)) {
> + TRACE_1("Comp RESTARTING due to comp-restart 
> recovery or RESTART admin op");
> + avnd_su_pres_state_set(cb, comp->su, 
> SA_AMF_PRESENCE_RESTARTING);
> + }
> + 
> + }
>   /* restarting -> instantiated */
>   if ((SA_AMF_PRESENCE_RESTARTING == prv_st) && 
> (SA_AMF_PRESENCE_INSTANTIATED == final_st)) {
>   /* reset the comp failed flag & set the oper state to 
> enabled */
> @@ -1021,6 +1050,16 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>   rc = avnd_comp_csi_reassign(cb, comp);
>   if (NCSCC_RC_SUCCESS != rc)
>   goto done;
> + /*
> +Mark SU Instantiated when atleast one 
> component moves to instantiated state.
> +Single comp restarting case or fault of all 
> restartable comps with comp-restart
> +recovery.  For more details read in 
> transition from INSTANTIATED to RESTARTING.
> +   

Re: [devel] [PATCH 1 of 1] amfnd: mark SU RESTARTING in comp FSM during restart of comp(s) [#1752]

2016-04-26 Thread minh chau
Hi,

I guess you can change all_csis_in_restarting_stable(su, ignored_csi = 
nullptr). For the fix of this ticket, we can specified the csi which 
should be skipped.
Adding specific code for particular type of component should be 
restricted in comp clc state change, current avnd_comp_clc_st_chng_prc() 
there are some parts of state change handling very similar of npi/pi 
that makes it up a long function. The avnd_comp_clc_st_chng_prc() should 
only have abstract skeleton of state handling and calling external 
methods of PI/NPI SU class. That's just my suggestion and you can decide 
it since I'm not sure if we have plan to refactor it.

Thanks,
Minh

On 26/04/16 20:51, praveen malviya wrote:
> Hi Minh,
>
> I thought of that but su_evaluate_restarting_state() is for PI SU. For 
> NPI SU all_csis_in_restarting_state() is there. But this function 
> cannot be used in comp FSM as the current COMPCSI will be marked 
> RESTARTING in comp_restart_init() very lately. At most for comp FSM , 
> the for loop in discussion can be moved in a function like 
> all_csis_in_restarting_execpt_given(su, csi).
>
> Thanks,
> Praveen
>
> On 26-Apr-16 3:24 PM, minh chau wrote:
>> Tested the patch, ack with minor comment, please see inline
>> Thanks,
>> Minh
>>
>> On 22/04/16 19:36, praveen.malv...@oracle.com wrote:
>>> osaf/services/saf/amf/amfnd/clc.cc|  85
>>> ++-
>>>   osaf/services/saf/amf/amfnd/include/avnd_su.h |   1 +
>>>   osaf/services/saf/amf/amfnd/susm.cc   |   2 +-
>>>   3 files changed, 86 insertions(+), 2 deletions(-)
>>>
>>>
>>> In reported problem, AMFD does not send state change notification for
>>> 1)SU presence state change from INSTANTIATED to RESTARTING and
>>> 2)SU presence state change from RESTARTING to INSTANTIATED
>>> when component restarts due to RESTART admin op on it or faults with
>>> comp-restart recovery.
>>>
>>> As per AMF spec, presence state of SU will be RESTARTING when all of
>>> its comp are in RESTARTING
>>> presence state. In this case when comp restarts due to fault or
>>> RESTART admin op on it, AMFND
>>> is not marking SU's presence state RESTARTING and SU remains in
>>> INSTANTIATED state. Since there
>>> is no change in presence state of SU, no state change notification for
>>> SU is sent.
>>>
>>> Patch fixes the problem by marking SU's presence state to RESTARTING
>>> when:
>>> 1)SU consists of a single restartable component and this comp restarts
>>> due to fault or RESTART
>>> admin op on it.
>>> 2)SU consists of all restartable components and all these components
>>> faults simultaneously.
>>>
>>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc
>>> b/osaf/services/saf/amf/amfnd/clc.cc
>>> --- a/osaf/services/saf/amf/amfnd/clc.cc
>>> +++ b/osaf/services/saf/amf/amfnd/clc.cc
>>> @@ -993,7 +993,36 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>>>   if ((SA_AMF_PRESENCE_TERMINATING == prv_st) &&
>>> (SA_AMF_PRESENCE_TERMINATION_FAILED == final_st)) {
>>>   /* termination failed.. log it */
>>>   }
>>> -
>>> +//Instantiated -> Restarting.
>>> +if ((prv_st == SA_AMF_PRESENCE_INSTANTIATED) && (final_st ==
>>> SA_AMF_PRESENCE_RESTARTING)) {
>>> +/*
>>> +   This presence state transition involving RESTARTING
>>> state may originate
>>> +   with or without any SU FSM event :
>>> +   a)Without SU FSM event: when component is 
>>> restartable
>>> +   and event is fault or RESTART admin op on 
>>> component.
>>> +b)With SU FSM event: when all comps are
>>> restartable and RESTART
>>> +  admin op on SU.
>>> +   In the case b) SU FSM takes care of moving presence
>>> state of SU to
>>> +   RESTARTING when all of its components (all
>>> restartable) are
>>> +   in RESTARTING state at any given point of time.
>>> +
>>> +   In case a), SU FSM never gets triggered because
>>> restart of component
>>> +   is totally restricted to comp FSM. So in this case
>>> comp FSM itself
>>> +   will have to mark SU's presence state RESTARTING
>>> whenever all the components
>>> +   are in Restarting state. This can 

Re: [devel] [PATCH 1 of 1] amfnd: mark SU RESTARTING in comp FSM during restart of comps [#1752] V2

2016-04-27 Thread minh chau
Ack from me
Thanks,
Minh

On 27/04/16 16:17, praveen.malv...@oracle.com wrote:
>   osaf/services/saf/amf/amfnd/clc.cc|  73 
> ++-
>   osaf/services/saf/amf/amfnd/include/avnd_su.h |   2 +
>   osaf/services/saf/amf/amfnd/susm.cc   |  10 ++-
>   3 files changed, 81 insertions(+), 4 deletions(-)
>
>
> In reported problem, AMFD does not send state change notification for
> 1)SU presence state change from INSTANTIATED to RESTARTING and
> 2)SU presence state change from RESTARTING to INSTANTIATED
> when component restarts due to RESTART admin op on it or faults with 
> comp-restart recovery.
>
> As per AMF spec, presence state of SU will be RESTARTING when all of its comp 
> are in RESTARTING
> presence state. In this case when comp restarts due to fault or RESTART admin 
> op on it, AMFND
> is not marking SU's presence state RESTARTING and SU remains in INSTANTIATED 
> state. Since there
> is no change in presence state of SU, no state change notification for SU is 
> sent.
>
> Patch fixes the problem by marking SU's presence state to RESTARTING when:
> 1)SU consists of a single restartable component and this comp restarts due to 
> fault or RESTART
> admin op on it.
> 2)SU consists of all restartable components and all these components faults 
> simultaneously.
>
> diff --git a/osaf/services/saf/amf/amfnd/clc.cc 
> b/osaf/services/saf/amf/amfnd/clc.cc
> --- a/osaf/services/saf/amf/amfnd/clc.cc
> +++ b/osaf/services/saf/amf/amfnd/clc.cc
> @@ -993,7 +993,36 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>   if ((SA_AMF_PRESENCE_TERMINATING == prv_st) && 
> (SA_AMF_PRESENCE_TERMINATION_FAILED == final_st)) {
>   /* termination failed.. log it */
>   }
> -
> + //Instantiated -> Restarting.
> + if ((prv_st == SA_AMF_PRESENCE_INSTANTIATED) && (final_st == 
> SA_AMF_PRESENCE_RESTARTING)) {
> + /*
> +This presence state transition involving RESTARTING 
> state may originate
> +with or without any SU FSM event :
> + a)Without SU FSM event: when component is 
> restartable
> +and event is fault or RESTART admin op on 
> component.
> + b)With SU FSM event: when all comps are 
> restartable and RESTART
> +   admin op on SU.
> +In the case b) SU FSM takes care of moving presence 
> state of SU to
> +RESTARTING when all of its components (all 
> restartable) are
> +in RESTARTING state at any given point of time.
> +
> +In case a), SU FSM never gets triggered because 
> restart of component
> +is totally restricted to comp FSM. So in this case 
> comp FSM itself
> +will have to mark SU's presence state RESTARTING 
> whenever all the components
> +are in Restarting state. This can occur in:
> + -confs with single restartable component 
> because of fault with comp-restart
> +  recovery or RESTART admin op on comp. OR
> + -confs with all comp restartable when all comps 
> faults with comp-restart recovery.
> +So if I am here because of case a) check if I can 
> mark SU RESTARTING.
> +  */
> + if ((isRestartSet(comp->su) == false) &&
> + ((m_AVND_COMP_IS_FAILED(comp)) || 
> (comp->admin_oper == true)) &&
> + (su_evaluate_restarting_state(comp->su) == 
> true)) {
> + TRACE_1("Comp RESTARTING due to comp-restart 
> recovery or RESTART admin op");
> + avnd_su_pres_state_set(cb, comp->su, 
> SA_AMF_PRESENCE_RESTARTING);
> + }
> + 
> + }
>   /* restarting -> instantiated */
>   if ((SA_AMF_PRESENCE_RESTARTING == prv_st) && 
> (SA_AMF_PRESENCE_INSTANTIATED == final_st)) {
>   /* reset the comp failed flag & set the oper state to 
> enabled */
> @@ -1021,6 +1050,16 @@ uint32_t avnd_comp_clc_st_chng_prc(AVND_
>   rc = avnd_comp_csi_reassign(cb, comp);
>   if (NCSCC_RC_SUCCESS != rc)
>   goto done;
> + /*
> +Mark SU Instantiated when atleast one 
> component moves to instantiated state.
> +Single comp restarting case or fault of all 
> restartable comps with comp-restart
> +recovery.  For more details read in 
> transition from INSTANTIATED to RESTARTING.
> +  */
> +  

Re: [devel] [PATCH 1 of 1] AMFND: Resend pg information after headless [#1719]

2016-04-28 Thread minh chau
Hi,

#1719 set milestone for GA, can you please review it?

Thanks
Minh

On 16/04/16 22:01, opensaf-devel-requ...@lists.sourceforge.net wrote:
> Message: 8
> Date: Thu, 07 Apr 2016 11:35:50 +1000
> From: Minh Hon Chau
> Subject: [devel] [PATCH 1 of 1] AMFND: Resend pg information after
>   headless[#1719]
> To:nagendr...@oracle.com,gary@dektech.com.au,
>   hans.nordeb...@ericsson.com,praveen.malv...@oracle.com
> Cc:opensaf-devel@lists.sourceforge.net
> Message-ID: <539c79d7102ab84de4f9.1459992950@kvmu1404>
> Content-Type: text/plain; charset="us-ascii"
>
>   osaf/services/saf/amf/amfnd/di.cc |  36 
> +
>   osaf/services/saf/amf/amfnd/include/avnd_di.h |   1 +
>   osaf/services/saf/amf/amfnd/verify.cc |  38 
> +--
>   3 files changed, 38 insertions(+), 37 deletions(-)
>
>
> If SC comes back from headless, currently protection group information
> will be lost at amfd.
>
> Patch moves the function for resending protection group for failover into
> di.cc for common usage, then reuses this function to recover pg from headless
>
> diff --git a/osaf/services/saf/amf/amfnd/di.cc 
> b/osaf/services/saf/amf/amfnd/di.cc
> --- a/osaf/services/saf/amf/amfnd/di.cc
> +++ b/osaf/services/saf/amf/amfnd/di.cc
> @@ -1260,6 +1260,8 @@ void avnd_diq_rec_del(AVND_CB *cb, AVND_
>   avnd_diq_rec_send(cb, pending_rec);
>   }
>   }
> + /* resend pg start track */
> + avnd_di_resend_pg_start_track(cb);
>   }
>   
>   /* free the avnd message contents */
> @@ -1459,6 +1461,40 @@ uint32_t avnd_evt_avd_role_change_evh(AV
>   return rc;
>   }
>   
> +/
> +  Name  : avnd_di_resend_pg_start_track
> +
> +  Description   : This routing will get called on AVD fail-over or coming 
> back
> +  from headless to send the PG start messages to the new AVD.
> +
> +  Arguments : cb  - ptr to the AvND control block
> +
> +  Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE
> +
> +  Notes : None.
> +**/
> +uint32_t avnd_di_resend_pg_start_track(AVND_CB *cb)
> +{
> + uint32_t rc = NCSCC_RC_SUCCESS;
> + AVND_PG *pg = 0;
> + SaNameT csi_name;
> + TRACE_ENTER();
> +
> + memset(&csi_name, '\0', sizeof(SaNameT));
> +
> + while (nullptr != (pg = m_AVND_PGDB_REC_GET_NEXT(cb->pgdb, csi_name))) {
> + rc = avnd_di_pg_act_send(cb, &pg->csi_name, 
> AVSV_PG_TRACK_ACT_START, true);
> +
> + if (NCSCC_RC_SUCCESS != rc)
> + break;
> +
> + csi_name = pg->csi_name;
> + }
> +
> + TRACE_LEAVE();
> + return rc;
> +}
> +
>   /**
>* The SC absence timer expired. Reboot this node.
>* @param cb
> diff --git a/osaf/services/saf/amf/amfnd/include/avnd_di.h 
> b/osaf/services/saf/amf/amfnd/include/avnd_di.h
> --- a/osaf/services/saf/amf/amfnd/include/avnd_di.h
> +++ b/osaf/services/saf/amf/amfnd/include/avnd_di.h
> @@ -83,6 +83,7 @@ uint32_t avnd_diq_rec_send(struct avnd_c
>   uint32_t avnd_di_reg_su_rsp_snd(struct avnd_cb_tag *cb, SaNameT *su_name, 
> uint32_t ret_code);
>   uint32_t avnd_di_ack_nack_msg_send(struct avnd_cb_tag *cb, uint32_t rcv_id, 
> uint32_t view_num);
>   extern void avnd_di_uns32_upd_send(int class_id, int attr_id, const SaNameT 
> *dn, uint32_t value);
> +extern uint32_t avnd_di_resend_pg_start_track(struct avnd_cb_tag *);
>   void avnd_sync_sisu(struct avnd_cb_tag *cb);
>   void avnd_sync_csicomp(struct avnd_cb_tag *cb);
>   
> diff --git a/osaf/services/saf/amf/amfnd/verify.cc 
> b/osaf/services/saf/amf/amfnd/verify.cc
> --- a/osaf/services/saf/amf/amfnd/verify.cc
> +++ b/osaf/services/saf/amf/amfnd/verify.cc
> @@ -34,42 +34,6 @@
>   
>   #include "avnd.h"
>   
> -static uint32_t avnd_send_pg_start_on_fover(AVND_CB *cb);
> -
> -/
> -  Name  : avnd_send_pg_start_on_fover
> -
> -  Description   : This routing will get called on AVD fail-over to send the
> -  PG start messages to the new AVD.
> -
> -  Arguments : cb  - ptr to the AvND control block
> -
> -  Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE
> -
> -  Notes : None.
> -**/
> -static uint32_t avnd_send_pg_start_on_fover(AVND_CB *cb)
> -{
> - uint32_t rc = NCSCC_RC_SUCCESS;
> - AVND_PG *pg = 0;
> - SaNameT csi_name;
> - TRACE_ENTER();
> -
> - memset(&csi_name, '\0', sizeof(SaNameT));
> -
> - while (nullptr != (pg = m_AVND_PGDB_REC_GET_NEXT(cb->pgdb, csi_name))) {
> - rc = avnd_di_pg_act_send(cb, &pg->csi_name, 
> AVSV_PG_TRACK_ACT_START, true);
> -
> - if (NCSCC_RC_SUCCESS != rc)
> -   

Re: [devel] [PATCH 1 of 1] NTFA: Update server state NTFA_NTFSV_NEW_ACTIVE to NTFA_NTFSV_UP at failover [#1785]

2016-04-28 Thread minh chau
Hi Praveen,

Yes this is problem that agent coming up during failover, the server 
state then set as NTFA_NTFSV_NEW_ACTIVE. At this state, server state 
will never set to NTFA_NTFSV_UP because the implementation treats 
NTFA_NTFSV_NEW_ACTIVE as NTFA_NTFSV_UP, since at both of those states 
NTF service should be operational. The other states will get TRY_AGAIN 
except NTFA_NTFSV_UP.

Thanks,
Minh

On 29/04/16 15:03, praveen malviya wrote:
> Hi Minh,
>
> Ntfa marks NTFA_NTFSV_NONE only when agent is created and in 
> ntfa_shutdown(). Is this problem being faced for a new agent coming up 
> during failover? In other cases agent is initialized so application 
> will get TRY_AGAIN for any API call or am I missing something here?
>
> Thanks,
> Praveen
>
> On 27-Apr-16 10:22 PM, Minh Hon Chau wrote:
>>  osaf/libs/agents/saf/ntfa/ntfa_util.c | 6 +-
>>  1 files changed, 5 insertions(+), 1 deletions(-)
>>
>>
>> If NTF client initializes and sends notification while failover 
>> happening, Agent
>> could possibly update server state as NTFA_NTFSV_NEW_ACTIVE. This 
>> server state
>> NTFA_NTFSV_NEW_ACTIVE currently treats as NTFA_NTFSV_UP, but in this 
>> scenario
>> server state is udpated as NTFA_NTFSV_NEW_ACTIVE which is not 
>> supported in
>> ntfa_update_ntfsv_state() and checkNtfServerState().
>>
>> Patch make sures server state will not dropped into 
>> NTFA_NTFSV_NEW_ACTIVE
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_util.c 
>> b/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> --- a/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> @@ -1504,7 +1504,11 @@ void ntfa_update_ntfsv_state(ntfa_ntfsv_
>>
>>  switch (ntfa_cb.ntfa_ntfsv_state){
>>  case NTFA_NTFSV_NONE:
>> -ntfa_cb.ntfa_ntfsv_state = changedState;
>> +if (changedState == NTFA_NTFSV_NEW_ACTIVE) {
>> +ntfa_cb.ntfa_ntfsv_state = NTFA_NTFSV_UP;
>> +} else {
>> +ntfa_cb.ntfa_ntfsv_state = changedState;
>> +}
>>  break;
>>  case NTFA_NTFSV_DOWN:
>>  if (changedState == NTFA_NTFSV_NEW_ACTIVE ||
>>
>


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] NTFA: Update server state NTFA_NTFSV_NEW_ACTIVE to NTFA_NTFSV_UP at failover [#1785]

2016-04-29 Thread minh chau
Hi Lennart,

I don't know how to test this, probably repeatedly sending notification 
during failover.
The code is quite clear to see the only place that server state changes 
to NTFA_NTFS_NEW_ACTIVE is at switch-case of NTFA_NTFS_NONE.
Can you please help to push this? This bug makes agent bad state since 
agent has no way to get out of NTFA_NTFS_NEW_ACTIVE

Thanks,
Minh

On 30/04/16 00:38, Lennart Lund wrote:
> Ack
>
> But how to test this?
>
> Thanks
> Lennart
>
>> -Original Message-
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: den 27 april 2016 18:52
>> To: Lennart Lund; praveen.malv...@oracle.com
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [PATCH 1 of 1] NTFA: Update server state
>> NTFA_NTFSV_NEW_ACTIVE to NTFA_NTFSV_UP at failover [#1785]
>>
>>   osaf/libs/agents/saf/ntfa/ntfa_util.c |  6 +-
>>   1 files changed, 5 insertions(+), 1 deletions(-)
>>
>>
>> If NTF client initializes and sends notification while failover happening, 
>> Agent
>> could possibly update server state as NTFA_NTFSV_NEW_ACTIVE. This
>> server state
>> NTFA_NTFSV_NEW_ACTIVE currently treats as NTFA_NTFSV_UP, but in this
>> scenario
>> server state is udpated as NTFA_NTFSV_NEW_ACTIVE which is not
>> supported in
>> ntfa_update_ntfsv_state() and checkNtfServerState().
>>
>> Patch make sures server state will not dropped into
>> NTFA_NTFSV_NEW_ACTIVE
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> b/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> --- a/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa_util.c
>> @@ -1504,7 +1504,11 @@ void ntfa_update_ntfsv_state(ntfa_ntfsv_
>>
>>  switch (ntfa_cb.ntfa_ntfsv_state){
>>  case NTFA_NTFSV_NONE:
>> -ntfa_cb.ntfa_ntfsv_state = changedState;
>> +if (changedState ==
>> NTFA_NTFSV_NEW_ACTIVE) {
>> +ntfa_cb.ntfa_ntfsv_state =
>> NTFA_NTFSV_UP;
>> +} else {
>> +ntfa_cb.ntfa_ntfsv_state =
>> changedState;
>> +}
>>  break;
>>  case NTFA_NTFSV_DOWN:
>>  if (changedState == NTFA_NTFSV_NEW_ACTIVE
>> ||


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] ntfa: Lower mds priority for initialize msg [#1818]

2016-05-15 Thread minh chau
Hi Praveen,

Please see comments in line.

Thanks,
Minh

On 13/05/16 17:17, praveen malviya wrote:
> Hi Minh,
>
> I am trying to understand the problem.
>
> As per these ntfd traces, in the down event at below at 3) ntfd clears 
> the client data so subsequent requests for Unsubscribe() and 
> Finalize() for same client will surely fail with reported error.
> One question : Why the down event at 3) below is coming before the 
> unsubscribe() and Finalize() requests keeping in mind that both 
> Unsubscribe() and Finalize() are sync calls.
[Minh]: The down event comes from mds thread, and unsubcribed() & 
finalize() come from client threads, so there could be a chance those 
events come in unexpected order. In mds_mcm_user_event_callback() the 
callback are sent with priority MDS_SEND_PRIORITY_MEDIUM while NTF agent 
is sending msg with MDS_SEND_PRIORITY_HIGH. I'm not familiar with mds 
code but the msg priority doesn't look right here.
As in patch description, this patch is not complete solution, it just 
lowers the initialize msg than the others, that means the other msgs 
which are in handle's life cycle are prioritized than starting 
initializing new handles. I think this doesn't cause any harm to agent's 
way of working. Or do you (and Lennart) see any side effect of this patch?

>
> I have run suite 2 and got the same error. But that error comes when 
> ntfd gets Unsubscribe() request after Finalize() request.In that 
> situation this is an acceptable error.
>
> Messages from NTFD traces:
>
> 1)Client came up at:
>  Mar 10 14:53:01.402362 osafntfd [463:ntfs_mds.c:0078] T8 
> NTFSV_INITIALIZE_REQ
> Mar 10 14:53:01.402493 osafntfd [463:ntfs_evt.c:0263] >> 
> proc_initialize_msg: dest 2010f0331
> Mar 10 14:53:01.402515 osafntfd [463:NtfClient.cc:0043] T3 
> NtfClient::NtfClient NtfClient 28 created mdest: 564113889559345
>
> 2)Its subscription request comes at :
> Mar 10 14:53:01.403921 osafntfd [463:ntfs_evt.c:0315] >> 
> proc_subscribe_msg
> Mar 10 14:53:01.403929 osafntfd [463:ntfs_evt.c:0318] T4 
> subscriptionId: 111
> Mar 10 14:53:01.403992 osafntfd [463:NtfSubscription.cc:0045] T2 
> Subscription 111 created for client_id 28
>
> 3) Then came the down event for the same client at (2 times):
> Mar 10 14:53:01.405487 osafntfd [463:ntfs_evt.c:0101] >> 
> proc_ntfa_updn_mds_msg
> Mar 10 14:53:01.405495 osafntfd [463:NtfAdmin.cc:0504] >> 
> clientRemoveMDS: REMOVE mdsDest: 564113889559345
> Mar 10 14:53:01.405583 osafntfd [463:ntfs_evt.c:0101] >> 
> proc_ntfa_updn_mds_msg
> Mar 10 14:53:01.405591 osafntfd [463:NtfAdmin.cc:0504] >> 
> clientRemoveMDS: REMOVE mdsDest: 564113889559345
> Mar 10 14:53:01.405598 osafntfd [463:NtfAdmin.cc:0521] << clientRemoveMDS
>
> 4)After this reported error got in syslog during Unsubscribe() and 
> Finalize() requests as the clinet was removed already:
> Mar 10 14:53:01.405862 osafntfd [463:ntfs_mds.c:0105] T8 
> NTFSV_FINALIZE_REQ
> Mar 10 14:53:01.405875 osafntfd [463:ntfs_evt.c:0338] >> 
> proc_unsubscribe_msg: client_id 28, subscriptionId 111
> Mar 10 14:53:01.406044 osafntfd [463:NtfAdmin.cc:0553] ER 
> NtfAdmin::subscriptionRemoved client 28 not found
> Mar 10 14:53:01.406061 osafntfd [463:ntfs_evt.c:0341] << 
> proc_unsubscribe_msg
> Mar 10 14:53:01.406079 osafntfd [463:ntfs_evt.c:0291] >> 
> proc_finalize_msg: client_id 28
> Mar 10 14:53:01.406088 osafntfd [463:NtfAdmin.cc:0480] T2 
> NtfAdmin::clientRemoved client 28 not found
> Mar 10 14:53:01.406095 osafntfd [463:ntfs_com.c:0074] >> 
> client_removed_res_lib: clientId: 28, rv: 1
>
>
> Thanks,
> Praveen
>
>
>
>
> On 11-May-16 7:36 PM, Minh Hon Chau wrote:
>>  osaf/libs/agents/saf/ntfa/ntfa_mds.c |  9 -
>>  1 files changed, 8 insertions(+), 1 deletions(-)
>>
>>
>> When running ntftest suite, there's an issue that the messages of 
>> previous test
>> coming after some of messages of current test. This issue can also 
>> happen in
>> real application.
>>
>> The patch lowers mds priority of initialize msg so that the other 
>> messages have
>> a chance to reach ntf server earlier. This is not a complete 
>> solution, since
>> there could be a race condition between the other messages not just 
>> only with
>> initialize messages. If it happens, priority of those messages can be 
>> considered
>> in which race condition happens.
>>
>> diff --git a/osaf/libs/agents/saf/ntfa/ntfa_mds.c 
>> b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
>> --- a/osaf/libs/agents/saf/ntfa/ntfa_mds.c
>> +++ b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
>> @@ -1177,7 +1177,14 @@ uint32_t ntfa_mds_msg_sync_send(ntfa_cb_
>>  mds_info.info.svc_send.i_msg = (NCSCONTEXT)i_msg;
>>  mds_info.info.svc_send.i_to_svc = NCSMDS_SVC_ID_NTFS;
>>  mds_info.info.svc_send.i_sendtype = MDS_SENDTYPE_SNDRSP;
>> -mds_info.info.svc_send.i_priority = MDS_SEND_PRIORITY_HIGH;
>> /* fixme? */
>> +/* initialize_msg is lower priority than the others so that
>> + * life cycle of agent will be pritorized to complete
>> + */
>> +if (i_msg-

Re: [devel] [PATCH 1 of 1] ntf: To change log severity level from LOG_ER to LOG_NO [#1832]

2016-05-19 Thread minh chau
ack

On 19/05/16 13:47, Nhat Pham wrote:
>   osaf/services/saf/ntfsv/ntfs/NtfLogger.cc |  2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
>
> During testing SC failover, the following ER log sometimes happens.
>
> ER Failed to log an alarm or security alarm notification (6)
>
> The severity of this log is not reasonable because the notification is not 
> lost.
> It is queued and logged next time. The severity should be changed to LOG_NO
>
> diff --git a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc 
> b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> --- a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> +++ b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> @@ -252,7 +252,7 @@ SaAisErrorT NtfLogger::logNotification(N
>  &logRecord);
>   if (SA_AIS_OK != errorCode)
>   {
> -LOG_ER("Failed to log an alarm or security alarm notification 
> (%d)", errorCode);
> +LOG_NO("Failed to log an alarm or security alarm notification 
> (%d)", errorCode);
>   if (errorCode == SA_AIS_ERR_LIBRARY || errorCode == 
> SA_AIS_ERR_BAD_HANDLE) {
>   LOG_ER("Fatal error SA_AIS_ERR_LIBRARY or 
> SA_AIS_ERR_BAD_HANDLE; exiting (%d)...", errorCode);
>   exit(EXIT_FAILURE);
>


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfnd: fix COMP-FO recovery when cleanup time is more than sufailoverprob[#1839]

2016-05-20 Thread minh chau
Hi Praveen,

I have tested the patch, it fixes the reported issue.
The problem seems to be a data race condition between thread handling 
clc fsm and timer expiry thread which reset all variables of a su while 
these variables have being used elsewhere in clc fsm. This race 
condition should cause other problems but it's not in scope of this ticket.

I think the idea of the patch is not using @su_err_esc_level in 
conditional statement in clc fsm, since it's not reliable because it can 
be reset anytime by timer expiry thread. I still see one place in 
avnd_comp_clc_inst_clean_hdler() that @su_err_esc_level being used in 
*if* statement, can you check whether it's safe?

Thanks,
Minh
On 19/05/16 20:35, praveen.malv...@oracle.com wrote:
>   osaf/services/saf/amf/amfnd/clc.cc  |   2 +-
>   osaf/services/saf/amf/amfnd/err.cc  |  11 ++-
>   osaf/services/saf/amf/amfnd/susm.cc |  10 +-
>   3 files changed, 16 insertions(+), 7 deletions(-)
>
>
> In the reported problem, fault of comp leads to comp-failover escalation.But
> failover did not happen.
>
> As a part of compt-failover, amfnd launched cleanup if component. Before 
> clean up gets
> completed, su-failover time expires and AMND resets escalation parameters. In 
> this way
> AMFND loses that the context of cleanup is comp-failver recovery. When comp 
> gets cleaned up
> successfully, AMFND (not ware of context) does inform AMFD for failover of 
> assignments.
>
> Patch fixes problem by remembering the comp or su failover context using a 
> separate flag
> and not relying on escalation params.
>
> diff --git a/osaf/services/saf/amf/amfnd/clc.cc 
> b/osaf/services/saf/amf/amfnd/clc.cc
> --- a/osaf/services/saf/amf/amfnd/clc.cc
> +++ b/osaf/services/saf/amf/amfnd/clc.cc
> @@ -2291,7 +2291,7 @@ uint32_t avnd_comp_clc_terming_cleansucc
>   if (m_AVND_COMP_IS_FAILED(comp) && m_AVND_SU_IS_FAILED(su) &&
>   m_AVND_SU_IS_PREINSTANTIABLE(su) && (su->sufailover == 
> false) &&
>   (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
> - (su->su_err_esc_level == AVND_ERR_ESC_LEVEL_2)) {
> + (m_AVND_SU_IS_FAILOVER(su))) {
>   /* yes, request director to orchestrate component failover */
>   rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>   }
> diff --git a/osaf/services/saf/amf/amfnd/err.cc 
> b/osaf/services/saf/amf/amfnd/err.cc
> --- a/osaf/services/saf/amf/amfnd/err.cc
> +++ b/osaf/services/saf/amf/amfnd/err.cc
> @@ -759,6 +759,8 @@ uint32_t avnd_err_rcvr_comp_failover(AVN
>   if (!m_AVND_SU_IS_FAILED(su)) {
>   m_AVND_SU_FAILED_SET(su);
>   }
> + //Remember component-failover/su-failover context.
> + m_AVND_SU_FAILOVER_SET(failed_comp->su);
>   
>   /* update su oper state */
>   m_AVND_SU_OPER_STATE_SET(su, SA_AMF_OPERATIONAL_DISABLED);
> @@ -839,6 +841,9 @@ uint32_t avnd_err_rcvr_su_failover(AVND_
>   reset_suRestart_flag(su);
>   su->admin_op_Id = static_cast(0);
>   }
> + //Remember component-failover/su-failover context.
> + m_AVND_SU_FAILOVER_SET(failed_comp->su);
> +
>   LOG_NO("Terminating components of '%s'(abruptly & 
> unordered)",su->name.value);
>   /* Unordered cleanup of components of failed SU */
>   for (comp = 
> m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
> @@ -932,6 +937,8 @@ uint32_t avnd_err_rcvr_node_switchover(A
>   {
>   reset_suRestart_flag(failed_su);
>   failed_su->admin_op_Id = static_cast(0);
> + //Remember su-failover context.
> + m_AVND_SU_FAILOVER_SET(failed_comp->su);
>   
>   LOG_NO("Terminating components of '%s'(abruptly & 
> unordered)",failed_su->name.value);
>   /* Unordered cleanup of components of failed SU */
> @@ -1075,6 +1082,8 @@ uint32_t avnd_err_su_repair(AVND_CB *cb,
>   if (all_comps_terminated_in_su(su) == true)
>   is_uninst = true;
>   
> + //Reset component-failover here. SU failover is reset as part of 
> REPAIRED admin op.
> + m_AVND_SU_FAILOVER_RESET(su);
>   /* scan & instantiate failed pi comps */
>   for (comp = 
> m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_FIRST(&su->comp_list));
>comp; comp = 
> m_AVND_COMP_FROM_SU_DLL_NODE_GET(m_NCS_DBLIST_FIND_NEXT(&comp->su_dll_node))) 
> {
> @@ -1584,7 +1593,7 @@ uint32_t avnd_err_rcvr_node_failfast(AVN
>   bool is_no_assignment_due_to_escalations(AVND_SU *su)
>   {
>   TRACE_ENTER();
> - if (((sufailover_in_progress(su) == true) && (su->su_err_esc_level == 
> AVND_ERR_ESC_LEVEL_2)) ||
> + if ((sufailover_in_progress(su) == true) ||
>   (sufailover_during_nodeswitchover(su) == true) ||
>   (avnd_cb->term_state == 
> AVND_TERM_STATE_NODE_FAILOVER_TERMINATING) ||
>   (avnd_cb->term_state == 
> AVND_TERM_STATE_NODE_FAI

Re: [devel] [PATCH 1 of 1] ntfa: Lower mds priority for initialize msg [#1818]

2016-05-20 Thread minh . chau
Hi,

This patch was aimed to fix the problem in test of running multiple api
life cycle parallel. But now I realise there is still same problem due to
mailbox priority at ntfd side. I will have to check and refloat another
patch

Thanks,
Minh

>
>
> On 16-May-16 7:04 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Please see comments in line.
>>
>> Thanks,
>> Minh
>>
>> On 13/05/16 17:17, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> I am trying to understand the problem.
>>>
>>> As per these ntfd traces, in the down event at below at 3) ntfd clears
>>> the client data so subsequent requests for Unsubscribe() and
>>> Finalize() for same client will surely fail with reported error.
>>> One question : Why the down event at 3) below is coming before the
>>> unsubscribe() and Finalize() requests keeping in mind that both
>>> Unsubscribe() and Finalize() are sync calls.
>> [Minh]: The down event comes from mds thread, and unsubcribed() &
>> finalize() come from client threads, so there could be a chance those
>> events come in unexpected order.
> [Praveen] Both Unsubscribe and Finalize() are sync call. So ntftest will
> call finalize() and when it completes then only mds thread will exit and
> down event will be generated. In that case only for Unsubsribe() such
> error can come as it may get timed in separate thread after completion
> of Finalize().
>
>
> Thanks,
> Praveen
>   In mds_mcm_user_event_callback() the
>> callback are sent with priority MDS_SEND_PRIORITY_MEDIUM while NTF agent
>> is sending msg with MDS_SEND_PRIORITY_HIGH. I'm not familiar with mds
>> code but the msg priority doesn't look right here.
>> As in patch description, this patch is not complete solution, it just
>> lowers the initialize msg than the others, that means the other msgs
>> which are in handle's life cycle are prioritized than starting
>> initializing new handles. I think this doesn't cause any harm to agent's
>> way of working. Or do you (and Lennart) see any side effect of this
>> patch?
>>
>>>
>>> I have run suite 2 and got the same error. But that error comes when
>>> ntfd gets Unsubscribe() request after Finalize() request.In that
>>> situation this is an acceptable error.
>>>
>>> Messages from NTFD traces:
>>>
>>> 1)Client came up at:
>>>  Mar 10 14:53:01.402362 osafntfd [463:ntfs_mds.c:0078] T8
>>> NTFSV_INITIALIZE_REQ
>>> Mar 10 14:53:01.402493 osafntfd [463:ntfs_evt.c:0263] >>
>>> proc_initialize_msg: dest 2010f0331
>>> Mar 10 14:53:01.402515 osafntfd [463:NtfClient.cc:0043] T3
>>> NtfClient::NtfClient NtfClient 28 created mdest: 564113889559345
>>>
>>> 2)Its subscription request comes at :
>>> Mar 10 14:53:01.403921 osafntfd [463:ntfs_evt.c:0315] >>
>>> proc_subscribe_msg
>>> Mar 10 14:53:01.403929 osafntfd [463:ntfs_evt.c:0318] T4
>>> subscriptionId: 111
>>> Mar 10 14:53:01.403992 osafntfd [463:NtfSubscription.cc:0045] T2
>>> Subscription 111 created for client_id 28
>>>
>>> 3) Then came the down event for the same client at (2 times):
>>> Mar 10 14:53:01.405487 osafntfd [463:ntfs_evt.c:0101] >>
>>> proc_ntfa_updn_mds_msg
>>> Mar 10 14:53:01.405495 osafntfd [463:NtfAdmin.cc:0504] >>
>>> clientRemoveMDS: REMOVE mdsDest: 564113889559345
>>> Mar 10 14:53:01.405583 osafntfd [463:ntfs_evt.c:0101] >>
>>> proc_ntfa_updn_mds_msg
>>> Mar 10 14:53:01.405591 osafntfd [463:NtfAdmin.cc:0504] >>
>>> clientRemoveMDS: REMOVE mdsDest: 564113889559345
>>> Mar 10 14:53:01.405598 osafntfd [463:NtfAdmin.cc:0521] <<
>>> clientRemoveMDS
>>>
>>> 4)After this reported error got in syslog during Unsubscribe() and
>>> Finalize() requests as the clinet was removed already:
>>> Mar 10 14:53:01.405862 osafntfd [463:ntfs_mds.c:0105] T8
>>> NTFSV_FINALIZE_REQ
>>> Mar 10 14:53:01.405875 osafntfd [463:ntfs_evt.c:0338] >>
>>> proc_unsubscribe_msg: client_id 28, subscriptionId 111
>>> Mar 10 14:53:01.406044 osafntfd [463:NtfAdmin.cc:0553] ER
>>> NtfAdmin::subscriptionRemoved client 28 not found
>>> Mar 10 14:53:01.406061 osafntfd [463:ntfs_evt.c:0341] <<
>>> proc_unsubscribe_msg
>>> Mar 10 14:53:01.406079 osafntfd [463:ntfs_evt.c:0291] >>
>>> proc_finalize_msg: client_id 28
>>> Mar 10 14:53:01.406088 osafntfd [463:NtfAdmin.cc:0480] T2
>>> NtfAdmin::clientRemoved client 28 not fo

Re: [devel] [PATCH 1 of 1] amfnd: fix COMP-FO recovery when cleanup time is more than sufailoverprob[#1839]

2016-05-23 Thread minh chau
Ack from me with a *missing place* to be updated

Thanks,
Minh
On 23/05/16 16:52, praveen malviya wrote:
> Hi All,
>
> I would like to push this patch today.
> Please provide feedback.
>
> Thanks,
> Praveen
>
> On 20-May-16 2:07 PM, praveen malviya wrote:
>>
>>
>> On 20-May-16 12:38 PM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> I have tested the patch, it fixes the reported issue.
>>> The problem seems to be a data race condition between thread handling
>>> clc fsm and timer expiry thread which reset all variables of a su while
>>> these variables have being used elsewhere in clc fsm. This race
>>> condition should cause other problems but it's not in scope of this 
>>> ticket.
>>>
>> I think timer expiry resets only escalation related variables only.
>> If there are other variables being reset then please raise a new ticket.
>>
>>> I think the idea of the patch is not using @su_err_esc_level in
>>> conditional statement in clc fsm, since it's not reliable because it 
>>> can
>>> be reset anytime by timer expiry thread. I still see one place in
>>> avnd_comp_clc_inst_clean_hdler() that @su_err_esc_level being used in
>>> *if* statement, can you check whether it's safe?
>> I missed this place.I will change that before pushing the patch.
>>
>> Thanks,
>> Praveen
>>
>>>
>>> Thanks,
>>> Minh
>>> On 19/05/16 20:35, praveen.malv...@oracle.com wrote:
>>>>   osaf/services/saf/amf/amfnd/clc.cc |   2 +-
>>>>   osaf/services/saf/amf/amfnd/err.cc  |  11 ++-
>>>>   osaf/services/saf/amf/amfnd/susm.cc |  10 +-
>>>>   3 files changed, 16 insertions(+), 7 deletions(-)
>>>>
>>>>
>>>> In the reported problem, fault of comp leads to comp-failover
>>>> escalation.But
>>>> failover did not happen.
>>>>
>>>> As a part of compt-failover, amfnd launched cleanup if component.
>>>> Before clean up gets
>>>> completed, su-failover time expires and AMND resets escalation
>>>> parameters. In this way
>>>> AMFND loses that the context of cleanup is comp-failver recovery. When
>>>> comp gets cleaned up
>>>> successfully, AMFND (not ware of context) does inform AMFD for
>>>> failover of assignments.
>>>>
>>>> Patch fixes problem by remembering the comp or su failover context
>>>> using a separate flag
>>>> and not relying on escalation params.
>>>>
>>>> diff --git a/osaf/services/saf/amf/amfnd/clc.cc
>>>> b/osaf/services/saf/amf/amfnd/clc.cc
>>>> --- a/osaf/services/saf/amf/amfnd/clc.cc
>>>> +++ b/osaf/services/saf/amf/amfnd/clc.cc
>>>> @@ -2291,7 +2291,7 @@ uint32_t avnd_comp_clc_terming_cleansucc
>>>>   if (m_AVND_COMP_IS_FAILED(comp) && m_AVND_SU_IS_FAILED(su) &&
>>>>   m_AVND_SU_IS_PREINSTANTIABLE(su) && (su->sufailover ==
>>>> false) &&
>>>>   (avnd_cb->oper_state != SA_AMF_OPERATIONAL_DISABLED) &&
>>>> -(su->su_err_esc_level == AVND_ERR_ESC_LEVEL_2)) {
>>>> +(m_AVND_SU_IS_FAILOVER(su))) {
>>>>   /* yes, request director to orchestrate component 
>>>> failover */
>>>>   rc = avnd_di_oper_send(cb, su, SA_AMF_COMPONENT_FAILOVER);
>>>>   }
>>>> diff --git a/osaf/services/saf/amf/amfnd/err.cc
>>>> b/osaf/services/saf/amf/amfnd/err.cc
>>>> --- a/osaf/services/saf/amf/amfnd/err.cc
>>>> +++ b/osaf/services/saf/amf/amfnd/err.cc
>>>> @@ -759,6 +759,8 @@ uint32_t avnd_err_rcvr_comp_failover(AVN
>>>>   if (!m_AVND_SU_IS_FAILED(su)) {
>>>>   m_AVND_SU_FAILED_SET(su);
>>>>   }
>>>> +//Remember component-failover/su-failover context.
>>>> +m_AVND_SU_FAILOVER_SET(failed_comp->su);
>>>> /* update su oper state */
>>>>   m_AVND_SU_OPER_STATE_SET(su, SA_AMF_OPERATIONAL_DISABLED);
>>>> @@ -839,6 +841,9 @@ uint32_t avnd_err_rcvr_su_failover(AVND_
>>>>   reset_suRestart_flag(su);
>>>>   su->admin_op_Id = static_cast(0);
>>>>   }
>>>> +//Remember component-failover/su-failover context.
>>>> +m_AVND_SU_FAILOVER_SET(failed_comp->su);
>>>> +
>>>>   LOG_NO("Terminating components

Re: [devel] [PATCH 1 of 1] amfnd: ignore hc expiry in unhealthy state [#1858]

2016-06-01 Thread minh chau
Hi Nagu,

Ack from me, not tested because I could not reproduce.

Thanks,
Minh

On 01/06/16 22:32, nagendr...@oracle.com wrote:
>   osaf/services/saf/amf/amfnd/chc.cc |  10 ++
>   1 files changed, 10 insertions(+), 0 deletions(-)
>
>
> If the component is not in instantiated state, then hc expiry
> should be ignored.
>
> diff --git a/osaf/services/saf/amf/amfnd/chc.cc 
> b/osaf/services/saf/amf/amfnd/chc.cc
> --- a/osaf/services/saf/amf/amfnd/chc.cc
> +++ b/osaf/services/saf/amf/amfnd/chc.cc
> @@ -926,6 +926,16 @@ uint32_t avnd_comp_hc_rec_tmr_exp(AVND_C
>   
>   TRACE_ENTER2("%s - %s, sts: %u", comp->name.value, rec->key.key, 
> rec->status);
>   
> + /* There is a chance that the term command has been issued to comp and
> +the timer has expired and it is in mail box.
> +So, if the component is not in healthy state, then don't start HC. */
> +
> + if (!m_AVND_COMP_PRES_STATE_IS_INSTANTIATED(comp)) {
> + TRACE_1("'%s' not instantiated, not starting HC", 
> comp->name.value);
> + rec->status = AVND_COMP_HC_STATUS_STABLE;
> + return rc;
> + }
> +
>   if (m_AVND_COMP_HC_REC_IS_AMF_INITIATED(rec)) {
>   if (rec->status == AVND_COMP_HC_STATUS_STABLE)
>   if (comp->is_hc_cmd_configured &&
>


--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] ntfa: Lower intialize req message [#1818] V2

2016-06-05 Thread minh chau

Hi Lennart,

I'm not sure what the comment "fixme?" intended to fix, but it seems not 
giving any information about the bug to be fixed, so I remove it

Please help to push the attached patch.

Thanks,
Minh

On 03/06/16 23:53, Lennart Lund wrote:

Ack
Review only.

Comment: Is the "/* fixme? */" comment relevant? If not please remove it

Thanks
Lennart


-Original Message-
From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
Sent: den 30 maj 2016 05:43
To: Lennart Lund ;
praveen.malv...@oracle.com; Minh Hon Chau

Cc: opensaf-devel@lists.sourceforge.net
Subject: [PATCH 1 of 1] ntfa: Lower intialize req message [#1818] V2

  osaf/libs/agents/saf/ntfa/ntfa_mds.c |  11 ++-
  1 files changed, 10 insertions(+), 1 deletions(-)


When running life cycle APIs from multiple handles in multiple threads, ntfd
processes
the previous NCSMDS_DOWN event from last finalize after processes
following initialze.
This will unexpectedly delete all clients which are running due to late
processing
NCSMDS_DOWN.

The problem is seen by sometimes (1) there's a shortcoming
NCSMDS_DOWN from last
finialize coming after next initialize req message at mds callback. Also, (2)
another
problem in ntfd, which is sending NTFSV_NTFS_EVT_NTFA_DOWN with
lower priority than
NTFSV_NTFS_NTFSV_MSG. This various prioriy will also cause ntfd process
NCSMDS_DOWN
after next intialize even NCSMDS_DOWN coming before initialize req
message at mds
callback.

At this stage, for the problem (1), it is not sure whether or not this is mds
issue,
since all APIs have been sent with high priority. This patch lowers send
priority of
initialize request msg, which gives a chance of all messages following last
finalize
response message coming to ntfd. For the problem (2), given that
NCSMDOWN and intialize
req message coming to ntfd in correct order at mds callback, now those
events will be
sent to ntfd's mailbox with the same priority
(MDS_SEND_PRIORITY_MEDIUM =
NCS_IPC_PRIORITY_NORMAL). The unexpected client deletion as described
above should not
be seen. After this patch, if this problem is seen again, it most likely from 
mds
who does not ensure NCSMDS_DOWN and intialize req are respectively sent
from Agent
and received at NTFD in right timing order.

diff --git a/osaf/libs/agents/saf/ntfa/ntfa_mds.c
b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
--- a/osaf/libs/agents/saf/ntfa/ntfa_mds.c
+++ b/osaf/libs/agents/saf/ntfa/ntfa_mds.c
@@ -1177,7 +1177,16 @@ uint32_t ntfa_mds_msg_sync_send(ntfa_cb_
mds_info.info.svc_send.i_msg = (NCSCONTEXT)i_msg;
mds_info.info.svc_send.i_to_svc = NCSMDS_SVC_ID_NTFS;
mds_info.info.svc_send.i_sendtype = MDS_SENDTYPE_SNDRSP;
-   mds_info.info.svc_send.i_priority = MDS_SEND_PRIORITY_HIGH;
/* fixme? */
+
+   /* Lower priority of initialize_req msg so that the other existing
+* life cycle msg can be completed, for multiple handles usage.
+*/
+   if (i_msg->info.api_info.type == NTFSV_INITIALIZE_REQ) {
+   mds_info.info.svc_send.i_priority =
MDS_SEND_PRIORITY_MEDIUM;
+   } else {
+   mds_info.info.svc_send.i_priority =
MDS_SEND_PRIORITY_HIGH;/* fixme? */
+   }
+
/* fill the sub send rsp strcuture */
mds_info.info.svc_send.info.sndrsp.i_time_to_wait = timeout;
/* timeto wait in 10ms FIX!!! */
mds_info.info.svc_send.info.sndrsp.i_to_dest = cb->ntfs_mds_dest;


ntfa: Lower intialize req message [#1818] V3

When running life cycle APIs from multiple handles in multiple threads, ntfd processes
the previous NCSMDS_DOWN event from last finalize after processes following initialze.
This will unexpectedly delete all clients which are running due to late processing
NCSMDS_DOWN.

The problem is seen by sometimes (1) there's a shortcoming NCSMDS_DOWN from last
finialize coming after next initialize req message at mds callback. Also, (2) another
problem in ntfd, which is sending NTFSV_NTFS_EVT_NTFA_DOWN with lower priority than
NTFSV_NTFS_NTFSV_MSG. This various prioriy will also cause ntfd process NCSMDS_DOWN
after next intialize even NCSMDS_DOWN coming before initialize req message at mds
callback.

At this stage, for the problem (1), it is not sure whether or not this is mds issue,
since all APIs have been sent with high priority. This patch lowers send priority of
initialize request msg, which gives a chance of all messages following last finalize
response message coming to ntfd. For the problem (2), given that NCSMDOWN and intialize
req message coming to ntfd in correct order at mds callback, now those events will be
sent to ntfd's mailbox with the same priority (MDS_SEND_PRIORITY_MEDIUM = 
NCS_IPC_PRIORITY_NORMAL). The unexpected client deletion as described above should not
be seen. After this patch, if this problem is seen again, it most likely from mds
who does not ensure NCSMDS_DOWN and intialize req are respectively sent from Agent
and received at NTFD in right timing order.


diff --git a/osaf/libs/agents/sa

Re: [devel] [PATCH 1 of 1] AMFD: Update RTA saAmfSUHostedByNode after headless [#1720] V2

2016-06-08 Thread minh chau
Hi Praveen,

saAmfSUHostedByNode should be cold synced after headless. The 
avd_process_state_info_queue() is called before active amfd creates 
Opensaf-2N assignment for SCs. The standby assignment for Opensaf-2N on 
standby controller includes RDE csi which then set node role as standby, 
thus the main thread in standby amfd wakes up and request cold sync later.

Thanks,
Minh

On 08/06/16 16:18, praveen malviya wrote:
> Hi Minh,
>
> Please find one query below.
>
> Thanks,
> Praveen
>
> On 07-Jun-16 6:17 AM, Minh Hon Chau wrote:
>>  osaf/services/saf/amf/amfd/siass.cc |  4 +++-
>>  1 files changed, 3 insertions(+), 1 deletions(-)
>>
>>
>> After being headless, the RTA saAmfSUHostedByNode of SU has not been 
>> updated.
>> That will cause the messaging to SU in wrong node.
>>
>> While performing recovery from headless, the saAmfSUHostedByNode of 
>> SU can be
>> updated by node that susi_state_info msg comes from.
>>
>> diff --git a/osaf/services/saf/amf/amfd/siass.cc 
>> b/osaf/services/saf/amf/amfd/siass.cc
>> --- a/osaf/services/saf/amf/amfd/siass.cc
>> +++ b/osaf/services/saf/amf/amfd/siass.cc
>> @@ -888,7 +888,9 @@ SaAisErrorT avd_susi_recreate(AVSV_N2D_N
>>  susi->su->inc_curr_act_si();
>>  susi->si->inc_curr_act_ass();
>>  }
>> -
>> +su->saAmfSUHostedByNode = node->name;
>> +avd_saImmOiRtObjectUpdate(&su->name, 
>> "saAmfSUHostedByNode",
>> +SA_IMM_ATTR_SANAMET, &su->saAmfSUHostedByNode);
> Is it not required to checkpoint SU_CONFIG state here as active AMFD 
> is updating SU states with the information from Node directors?  Only 
> possibility of standby getting updated is cold sync. Will the cold 
> sync always happen after all the node directors have sent the 
> sisu_recreate messages?
>
>
>> m_AVSV_SEND_CKPT_UPDT_ASYNC_ADD(avd_cb, susi, AVSV_CKPT_AVD_SI_ASS);
>>  }
>>
>>
>


--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: avoid resetting alarm for duplicate node ups [#1893]

2016-06-26 Thread minh chau
Hi Nagu,

Patch looks good.
Can we think an alternative that condition of calling 
avd_process_state_info_queue() by checking node_state as 
AVD_AVND_STATE_ABSENT, so that we don't have to introduce new static var?

Thanks,
Minh

On 23/06/16 20:45, nagendr...@oracle.com wrote:
>   osaf/services/saf/amf/amfd/ndfsm.cc |  12 +++-
>   1 files changed, 11 insertions(+), 1 deletions(-)
>
>
> When Amfd receives duplicate node up messages from Act
> amfnd, then it tries to reset alarm_sent for SI.
> This happens when cluster is recovering from headless state.
> And if that happens then when those SIs gets assigned,
> then alarms are not reset.
> This patch fixes this issue. It avoids resetting alarm_sent
> when duplicate node ups are received.
>
> diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc 
> b/osaf/services/saf/amf/amfd/ndfsm.cc
> --- a/osaf/services/saf/amf/amfd/ndfsm.cc
> +++ b/osaf/services/saf/amf/amfd/ndfsm.cc
> @@ -51,11 +51,13 @@ void avd_process_state_info_queue(AVD_CL
>   uint32_t i;
>   const auto queue_size = cb->evt_queue.size();
>   AVD_EVT_QUEUE *queue_evt = nullptr;
> + /* Counter for Act Amfnd node up message.*/
> + static int act_amfnd_node_up_count = 0;
>   
>   TRACE_ENTER();
>   
>   TRACE("queue_size before processing: %lu", (unsigned long) queue_size);
> -
> + act_amfnd_node_up_count ++;
>   // recover assignments from state info
>   for(i=0 ; i   queue_evt = cb->evt_queue.front();
> @@ -91,6 +93,13 @@ void avd_process_state_info_queue(AVD_CL
>   }
>   }
>   
> + /* Alarms shouldn't be reset in next subsequent node up message.
> +Because in the previous node up messages queue_size might have
> +been zero. In the subsequent node up messages, this might cause
> +alarm_sent to get reset and this may cause unassigned alarm to
> +exist even those SIs are assigned after some time.*/
> + if (act_amfnd_node_up_count > 1) goto done;
> +
>   // Once active amfd looks up the state info from queue, that means node 
> sync
>   // finishes. Therefore, if the queue is empty, this active amfd is 
> coming
>   // from a cluster restart, the alarm state should be reset.
> @@ -115,6 +124,7 @@ void avd_process_state_info_queue(AVD_CL
>   }
>   }
>   }
> +done:
>   TRACE("queue_size after processing: %lu", (unsigned long) 
> cb->evt_queue.size());
>   TRACE_LEAVE();
>   }
>


--
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: avoid resetting alarm for duplicate node ups [#1893]

2016-06-27 Thread minh . chau
Hi Nagu,

So no comment and ack from me.

Thanks
Minh

> Hi Minh,
>
> Thanks for your review time.
>
> I had tried something like you suggested, my observation was that I was
> getting into some other problem(don't remember exactly), so I thought this
> could be safest.
>
> Thanks
> -Nagu
>> -Original Message-
>> From: minh chau [mailto:minh.c...@dektech.com.au]
>> Sent: 27 June 2016 07:42
>> To: Nagendra Kumar; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [PATCH 1 of 1] amfd: avoid resetting alarm for duplicate
>> node
>> ups [#1893]
>>
>> Hi Nagu,
>>
>> Patch looks good.
>> Can we think an alternative that condition of calling
>> avd_process_state_info_queue() by checking node_state as
>> AVD_AVND_STATE_ABSENT, so that we don't have to introduce new static
>> var?
>>
>> Thanks,
>> Minh
>>
>> On 23/06/16 20:45, nagendr...@oracle.com wrote:
>> >   osaf/services/saf/amf/amfd/ndfsm.cc |  12 +++-
>> >   1 files changed, 11 insertions(+), 1 deletions(-)
>> >
>> >
>> > When Amfd receives duplicate node up messages from Act amfnd, then it
>> > tries to reset alarm_sent for SI.
>> > This happens when cluster is recovering from headless state.
>> > And if that happens then when those SIs gets assigned, then alarms are
>> > not reset.
>> > This patch fixes this issue. It avoids resetting alarm_sent when
>> > duplicate node ups are received.
>> >
>> > diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc
>> > b/osaf/services/saf/amf/amfd/ndfsm.cc
>> > --- a/osaf/services/saf/amf/amfd/ndfsm.cc
>> > +++ b/osaf/services/saf/amf/amfd/ndfsm.cc
>> > @@ -51,11 +51,13 @@ void avd_process_state_info_queue(AVD_CL
>> >uint32_t i;
>> >const auto queue_size = cb->evt_queue.size();
>> >AVD_EVT_QUEUE *queue_evt = nullptr;
>> > +  /* Counter for Act Amfnd node up message.*/
>> > +  static int act_amfnd_node_up_count = 0;
>> >
>> >TRACE_ENTER();
>> >
>> >TRACE("queue_size before processing: %lu", (unsigned long)
>> > queue_size);
>> > -
>> > +  act_amfnd_node_up_count ++;
>> >// recover assignments from state info
>> >for(i=0 ; i> >queue_evt = cb->evt_queue.front(); @@ -91,6 +93,13 @@
>> void
>> > avd_process_state_info_queue(AVD_CL
>> >}
>> >}
>> >
>> > +  /* Alarms shouldn't be reset in next subsequent node up message.
>> > + Because in the previous node up messages queue_size might have
>> > + been zero. In the subsequent node up messages, this might cause
>> > + alarm_sent to get reset and this may cause unassigned alarm to
>> > + exist even those SIs are assigned after some time.*/
>> > +  if (act_amfnd_node_up_count > 1) goto done;
>> > +
>> >// Once active amfd looks up the state info from queue, that means
>> node sync
>> >// finishes. Therefore, if the queue is empty, this active amfd is
>> coming
>> >// from a cluster restart, the alarm state should be reset.
>> > @@ -115,6 +124,7 @@ void avd_process_state_info_queue(AVD_CL
>> >}
>> >}
>> >}
>> > +done:
>> >TRACE("queue_size after processing: %lu", (unsigned long) cb-
>> >evt_queue.size());
>> >TRACE_LEAVE();
>> >   }
>> >
>>
>



--
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: allow lock and unlock operation on NoRed MW SI. [#1834]

2016-07-07 Thread minh chau
Hi Praveen,

NoRed MW allows locking SU only, and not allow locking SI, which means 
to me if NoRed MW SUs are unlocked it must provide its services (SMFND, 
IMMND, CPND, CLMNA)
I'm not sure whether this is a must-have requirement or not, since this 
behavior has existed for long.

Thanks,
Minh

On 05/07/16 15:20, praveen malviya wrote:
> Hi,
>
> Please provide your feedback on this patch.
> I would like to push this patch by tomorrow.
>
> Thanks,
> Praveen
>
> On 24-Jun-16 11:06 AM, praveen.malv...@oracle.com wrote:
>>  osaf/services/saf/amf/amfd/si.cc |  20 +++-
>>  1 files changed, 19 insertions(+), 1 deletions(-)
>>
>>
>> In the reported issue, amfd crashes during deletion of MW NoRed SI while
>> standby SC is coming up.
>>
>> Here requirement is to bring down a payload node and delete its 
>> related configuration
>> like node, MW SI, MW SU etc. As per AMF PR doc section 7.1.4 , SI 
>> must be locked before
>> deleting it. Also AMF allows deletion of SI in unlocked state if it 
>> is unassigned, but
>> it is not the recommended way. Since lock operation on NoRed MW SI is 
>> not allowed,
>> the only way to delete is when it is unassigned. This imposes another 
>> criterion to bring
>> down the node or lock NoRed SU so that its MW SI gets unassigned and 
>> its deletion can proceed.
>> Even in this case also, AMF can pick the same SI and assign it to 
>> some other unassigned node
>> or any node joining the cluster that time. Thus there is no gaurantee 
>> that SI will remain
>> unassigned.
>>
>> Patch allows lock and unlock admin op on MW NoRed SI accept the one 
>> assigned on
>> active SC.
>>
>> diff --git a/osaf/services/saf/amf/amfd/si.cc 
>> b/osaf/services/saf/amf/amfd/si.cc
>> --- a/osaf/services/saf/amf/amfd/si.cc
>> +++ b/osaf/services/saf/amf/amfd/si.cc
>> @@ -800,11 +800,29 @@ static void si_admin_op_cb(SaImmOiHandle
>>
>>  si = avd_si_get(objectName);
>>
>> -if ((operationId != SA_AMF_ADMIN_SI_SWAP) && 
>> (si->sg_of_si->sg_ncs_spec == true)) {
>> +if ((operationId != SA_AMF_ADMIN_SI_SWAP) && (operationId != 
>> SA_AMF_ADMIN_LOCK) &&
>> +(operationId != SA_AMF_ADMIN_UNLOCK) && 
>> (si->sg_of_si->sg_ncs_spec == true)) {
>>  report_admin_op_error(immOiHandle, invocation, 
>> SA_AIS_ERR_NOT_SUPPORTED, nullptr,
>>  "Admin operation %llu on MW SI is not allowed", 
>> operationId);
>>  goto done;
>>  }
>> +if (((operationId == SA_AMF_ADMIN_LOCK) || (operationId == 
>> SA_AMF_ADMIN_UNLOCK)) &&
>> +(si->sg_of_si->sg_ncs_spec == true)) {
>> +if (si->sg_of_si->sg_redundancy_model == 
>> SA_AMF_2N_REDUNDANCY_MODEL) {
>> +report_admin_op_error(immOiHandle, invocation,
>> +SA_AIS_ERR_NOT_SUPPORTED, nullptr,
>> +"Admin operation %llu on MW 2N SI is not 
>> allowed", operationId);
>> +goto done;
>> +} else if ((si->sg_of_si->sg_redundancy_model == 
>> SA_AMF_NO_REDUNDANCY_MODEL) &&
>> +(si->list_of_sisu != nullptr) && (operationId == 
>> SA_AMF_ADMIN_LOCK) &&
>> +(avd_cb->node_id_avd == 
>> si->list_of_sisu->su->su_on_node->node_info.nodeId)) {
>> +//No specific reason, but conforming to existing notions 
>> for active SC.
>> +report_admin_op_error(immOiHandle, invocation,
>> +SA_AIS_ERR_NOT_SUPPORTED, nullptr,
>> +"Admin lock of MW SI assigned on Active SC is not 
>> allowed");
>> +goto done;
>> +}
>> +}
>>  /* if Tolerance timer is running for any SI's withing this SG, 
>> then return SA_AIS_ERR_TRY_AGAIN */
>>  if (sg_is_tolerance_timer_running_for_any_si(si->sg_of_si)) {
>>  report_admin_op_error(immOiHandle, invocation, 
>> SA_AIS_ERR_TRY_AGAIN, nullptr,
>>
>> --
>>  
>>
>> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
>> Francisco, CA to explore cutting-edge tech and listen to tech luminaries
>> present their vision of the future. This family event has something for
>> everyone, including kids. Get more information and register today.
>> http://sdm.link/attshape
>> ___
>> Opensaf-devel mailing list
>> Opensaf-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>
>


--
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: allow lock and unlock operation on NoRed MW SI. [#1834]

2016-07-08 Thread minh chau


On 08/07/16 16:05, praveen malviya wrote:
>
>
> On 07-Jul-16 6:31 PM, minh chau wrote:
>> Hi Praveen,
>>
>> NoRed MW allows locking SU only, and not allow locking SI, which means
>> to me if NoRed MW SUs are unlocked it must provide its services (SMFND,
>> IMMND, CPND, CLMNA)
>> I'm not sure whether this is a must-have requirement or not, since this
>> behavior has existed for long.
>
> Lock on NoRed MW SU is supported for upgrade purpose.
> AMF allows deletion of SI if it is locked (recommended way) and when 
> it is unassigned(non-recommended way).
> So this patch supports lock on NoRed MW SI for deletion.
> Lock on MW NoRed SU may not be helpful because AMFD can still assign 
> that SI to some other node.
>
I'm OK with this approach, but my point is whether any reason in history 
that had to make this restriction - not allow lock NoRed MW SI, since 
this behavior has been there for long time.
I did 'hg annotate' but could not find any link to this restriction. So 
maybe other AMF maintainers can also help to check this patch, and it 
looks more likely an enhancement than a defect.

Thanks,
Minh
> Thanks,
> Praveen
>>
>> Thanks,
>> Minh
>>
>> On 05/07/16 15:20, praveen malviya wrote:
>>> Hi,
>>>
>>> Please provide your feedback on this patch.
>>> I would like to push this patch by tomorrow.
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 24-Jun-16 11:06 AM, praveen.malv...@oracle.com wrote:
>>>>  osaf/services/saf/amf/amfd/si.cc | 20 +++-
>>>>  1 files changed, 19 insertions(+), 1 deletions(-)
>>>>
>>>>
>>>> In the reported issue, amfd crashes during deletion of MW NoRed SI 
>>>> while
>>>> standby SC is coming up.
>>>>
>>>> Here requirement is to bring down a payload node and delete its
>>>> related configuration
>>>> like node, MW SI, MW SU etc. As per AMF PR doc section 7.1.4 , SI
>>>> must be locked before
>>>> deleting it. Also AMF allows deletion of SI in unlocked state if it
>>>> is unassigned, but
>>>> it is not the recommended way. Since lock operation on NoRed MW SI is
>>>> not allowed,
>>>> the only way to delete is when it is unassigned. This imposes another
>>>> criterion to bring
>>>> down the node or lock NoRed SU so that its MW SI gets unassigned and
>>>> its deletion can proceed.
>>>> Even in this case also, AMF can pick the same SI and assign it to
>>>> some other unassigned node
>>>> or any node joining the cluster that time. Thus there is no gaurantee
>>>> that SI will remain
>>>> unassigned.
>>>>
>>>> Patch allows lock and unlock admin op on MW NoRed SI accept the one
>>>> assigned on
>>>> active SC.
>>>>
>>>> diff --git a/osaf/services/saf/amf/amfd/si.cc
>>>> b/osaf/services/saf/amf/amfd/si.cc
>>>> --- a/osaf/services/saf/amf/amfd/si.cc
>>>> +++ b/osaf/services/saf/amf/amfd/si.cc
>>>> @@ -800,11 +800,29 @@ static void si_admin_op_cb(SaImmOiHandle
>>>>
>>>>  si = avd_si_get(objectName);
>>>>
>>>> -if ((operationId != SA_AMF_ADMIN_SI_SWAP) &&
>>>> (si->sg_of_si->sg_ncs_spec == true)) {
>>>> +if ((operationId != SA_AMF_ADMIN_SI_SWAP) && (operationId !=
>>>> SA_AMF_ADMIN_LOCK) &&
>>>> +(operationId != SA_AMF_ADMIN_UNLOCK) &&
>>>> (si->sg_of_si->sg_ncs_spec == true)) {
>>>>  report_admin_op_error(immOiHandle, invocation,
>>>> SA_AIS_ERR_NOT_SUPPORTED, nullptr,
>>>>  "Admin operation %llu on MW SI is not allowed",
>>>> operationId);
>>>>  goto done;
>>>>  }
>>>> +if (((operationId == SA_AMF_ADMIN_LOCK) || (operationId ==
>>>> SA_AMF_ADMIN_UNLOCK)) &&
>>>> +(si->sg_of_si->sg_ncs_spec == true)) {
>>>> +if (si->sg_of_si->sg_redundancy_model ==
>>>> SA_AMF_2N_REDUNDANCY_MODEL) {
>>>> +report_admin_op_error(immOiHandle, invocation,
>>>> +SA_AIS_ERR_NOT_SUPPORTED, nullptr,
>>>> +"Admin operation %llu on MW 2N SI is not
>>>> allowed", operationId);
>>>> +goto done;
>>>> +} else if ((si->sg_of_si->sg_redundancy_model 

Re: [devel] [PATCH 1 of 1] AMFD: Initialize CLM, NTF handle in thread [#1828]

2016-07-22 Thread minh chau
Hi,

I think Praveen's comment on #1812 was worrying about amfd hanging when 
init with CLM, this patch does not change position of CLM initialization 
and also it's done in thread so it will be ok?
Regarding Anders' comment: I did quick test, lock clm on standby 
controller and reboot it, when it comes up it initializes with CLM 
successfully, so it seems we won't get ERR_UNAVAILABLE on configured 
non-member node
Thought that handling ERR_UNAVAILABLE should be removed in CLM init in 
the scope of this ticket, but would it be useful in case that amfd 
re-init CLM up on receiving BAD_HANDLE?

Thanks,
Minh
On 20/07/16 22:06, Anders Widell wrote:
> Regarding ticket [#1781], I think that one requires some more thought. 
> First of all, do we want to assign the STANDBY role to the OpenSAF 
> directors running on a CLM locked node? If we do want a CLM locked 
> node to become standby,  then CLM ought to provide service to 
> middleware clients running on a locked node! It must differentiate 
> between middleware- and non-middleware clients.
>
> By the way, saClmInitialize_4() and saClmSelectionObjectGet() should 
> not return ERR_UNAVAILABLE on configured non-member nodes - they shall 
> only return ERR_UNAVAILABLE on unconfigured nodes.
>
> regards,
> Anders Widell
>
> On 07/20/2016 12:55 PM, praveen malviya wrote:
>> Hi Minh,
>>
>> For the ticket #1812 I had given one comment.
>> It was:
>> "In the fix of ticekt #1781, it has been suggested for spare controllers
>>to init with CLM  before become AMF role aware. Here also same 
>> problem
>>will come. If spare is running on CLM locked node then it will never
>> come out of avd_clm_init() as ERR_UNAVAILBLE is handled there for
>> reinit. Also admin will not be able to unlock it until one of the
>> controller joins. In this particular case AMFD must exit instead of
>> indefinitely calling the CLM init API.
>> Also this fix will cause re-init will CLM in non-headless case also. I
>> think in non-headless case it will be good to initialize with CLM in a
>> separate thread"
>>
>> I think it is valid here also or is it handled.
>> Thanks,
>> Praveen
>> On 11-Jul-16 1:01 PM, Minh Hon Chau wrote:
>>> osaf/services/saf/amf/amfd/clm.cc|  103 
>>> ++
>>>   osaf/services/saf/amf/amfd/include/cb.h  |7 +-
>>>   osaf/services/saf/amf/amfd/include/clm.h |6 +-
>>>   osaf/services/saf/amf/amfd/include/ntf.h |3 +
>>>   osaf/services/saf/amf/amfd/main.cc   |   20 -
>>>   osaf/services/saf/amf/amfd/ntf.cc|   85 
>>> +
>>>   osaf/services/saf/amf/amfd/role.cc   |   15 +---
>>>   7 files changed, 204 insertions(+), 35 deletions(-)
>>>
>>>
>>> In new controller reallocation scenario with roaming sc feature, if 
>>> immnd
>>> dies in the node becoming active, the circular dependencies among 
>>> Opensaf
>>> services appear, which leads eventually to a reboot.
>>>
>>> The dependencies are:
>>> .clmd can not use IMM services since immnd dies
>>> .immnd needs restarted by amfnd
>>> .amfnd is hanging since amfnd is calling CLM services
>>> .amfd is also hanging since amfd is calling CLM and NTF services
>>> .ntfd is hanging due to logd's dependencies on IMM
>>>
>>> The problem could be solved if:
>>> . amfd initializes NTF, CLM handle in thread in initialization phase
>>> . amfnd initializes CLM in thread if amfnd receives clm bad handle
>>>
>>> Since amfnd has already initialized CLM in thread up on receiving 
>>> clm bad
>>> handle. This patch does initialze CLM, NTF in thread at amfd side. 
>>> Also,
>>> threading initialization in this patch can be refactored later by 
>>> utilizing
>>> the support of #1609
>>>
>>> diff --git a/osaf/services/saf/amf/amfd/clm.cc 
>>> b/osaf/services/saf/amf/amfd/clm.cc
>>> --- a/osaf/services/saf/amf/amfd/clm.cc
>>> +++ b/osaf/services/saf/amf/amfd/clm.cc
>>> @@ -386,14 +386,26 @@ static const SaClmCallbacksT_4 clm_callb
>>>   /*.saClmClusterTrackCallback =*/ clm_track_cb
>>>   };
>>>
>>> -SaAisErrorT avd_clm_init(void)
>>> +SaAisErrorT avd_clm_init(AVD_CL_CB* cb)
>>>   {
>>> -SaAisErrorT error = SA_AIS_OK;
>>> +SaAisErrorT error = SA_AIS_OK;
>>> +SaClmHandleT clm_handle = 0;
>>> +SaSelectionObjectT sel_obj = 0;
>>>
>>> +cb->clmHandle = 0;
>>> +cb->clm_sel_obj = 0;
>>>   TRACE_ENTER();
>>> +/*
>>> + * TODO: This CLM initialization thread can be re-factored
>>> + * after having osaf dedicated thread, so that all APIs calls
>>> + * to external service can be automatically retried with result
>>> + * code (TRY_AGAIN, TIMEOUT, UNAVAILABLE), or reinitialized within
>>> + * BAD_HANDLE. Also, duplicated codes in initialization thread
>>> + * will be moved to osaf dedicated thread
>>> + */
>>>   for (;;) {
>>>   SaVersionT Version = { 'B', 4, 1 };
>>> -error = saClmInitialize_4(&avd_cb->clmHandle, 
>>> &clm_callbacks, &Version);
>>> +error = saClmInitialize_4(&clm_han

Re: [devel] [PATCH 1 of 1] AMFD: Initialize CLM, NTF handle in thread [#1828]

2016-07-24 Thread minh chau
Hi,

I have tried to reproduce the problem in #1781, when amfd in non-member 
node gets assigned role, amfd initializes CLM successfully. Only sfmd, 
cpkt got UNAVAILABLE from saClmClusterNodeGet().
Ander's explanation sounds reasonably that amfd should initialize CLM 
after amfd get assigned role. If there's no comment on the patch and no 
object to Anders' suggestion till Wednesday this week, I would like to 
float V2 patch that incorporates Anders' suggestion.

Thanks,
Minh
On 22/07/16 22:31, Anders Widell wrote:
> Yes, that's what I meant when I said that if we really do wish to 
> assign the STANDBY role to the directors running on a CLM locked node, 
> then we need to make sure CLM can differentiate between middleware- 
> and non-middleware clients. Middleware clients should then not be 
> affected by the CLM node lock.
>
> regards,
> Anders Widell
>
> On 07/22/2016 02:16 PM, praveen malviya wrote:
>> Such a configured non-member node will be notified only for node 
>> local changes in track callback and not cluster wide changes. This is 
>> one reason clm lock of active controller node is not allowed by CLM 
>> and AMF also rejects this in validation step.
>>
>>
>> Thanks,
>> Praveen
>>
>>
>> On 22-Jul-16 5:11 PM, Anders Widell wrote:
>>> Ok good, so then we there is no problem calling saClmInitialize_4() and
>>> saClmSelectionObjectGet() on a locked node. I checked
>>> saClmClusterTrack_4() and it should also be safe according to the spec.
>>>
>>> I would prefer if you don't initialize the CLM and NTF handles on spare
>>> nodes, until they actually get a STANDBY or ACTIVE assignment. One
>>> reason is performance - initializing a handle presumably consumes
>>> resources on both the spare controller node as well as in the CLM/NTF
>>> server running on the active node. We (currently) don't need the 
>>> handles
>>> as long as we are running as spares, so this is a waste of resources.
>>> But more importantly, I think it is safer to defer the initialization
>>> until the handles are needed. Let's suppose the spare starts up and
>>> keeps running for a very long time as a spare. Then - much later - you
>>> become STANDBY or ACTIVE. Now you want to start using those handles 
>>> that
>>> you initialized way back in history. Who knows if the handles are still
>>> working? A handle is essentially connection to a server running on
>>> another node. A lot of things may have happened since we initialized 
>>> it;
>>> controller switch-overs, fail-overs, software upgrades, and headless
>>> situations. Yes the handles ought to work - and if not, they don't they
>>> ought to return BAD_HANDLE so that we can re-initialize them. But there
>>> is also a small possibility that a bug causes the handle to simply not
>>> work. A "fresh", newly initialized handle would be safer than a 
>>> one-year
>>> old handle.
>>>
>>> Do you think it would work if you defer creating these background
>>> threads until we get an ACTIVE/STANDBY assignment?
>>>
>>> regards,
>>> Anders Widell
>>>
>>> On 07/22/2016 12:20 PM, minh chau wrote:
>>>> Hi,
>>>>
>>>> I think Praveen's comment on #1812 was worrying about amfd hanging
>>>> when init with CLM, this patch does not change position of CLM
>>>> initialization and also it's done in thread so it will be ok?
>>>> Regarding Anders' comment: I did quick test, lock clm on standby
>>>> controller and reboot it, when it comes up it initializes with CLM
>>>> successfully, so it seems we won't get ERR_UNAVAILABLE on configured
>>>> non-member node
>>>> Thought that handling ERR_UNAVAILABLE should be removed in CLM init in
>>>> the scope of this ticket, but would it be useful in case that amfd
>>>> re-init CLM up on receiving BAD_HANDLE?
>>>>
>>>> Thanks,
>>>> Minh
>>>> On 20/07/16 22:06, Anders Widell wrote:
>>>>> Regarding ticket [#1781], I think that one requires some more
>>>>> thought. First of all, do we want to assign the STANDBY role to the
>>>>> OpenSAF directors running on a CLM locked node? If we do want a CLM
>>>>> locked node to become standby, then CLM ought to provide service to
>>>>> middleware clients running on a locked node! It must differentiate
>>>>> between middleware- and non-middleware clients.
>>>>&g

Re: [devel] [PATCH 1 of 1] ntfsv: refactor logging long dn notification [#1585]

2016-07-24 Thread minh chau
Hi Vu,

The patch looks good. Can I test this patch after #1315 is pushed? I run 
into osaf_abort() for now.

Thanks,
Minh

On 22/07/16 21:16, Vu Minh Nguyen wrote:
>   osaf/services/saf/ntfsv/ntfs/NtfLogger.cc |  51 
> +++---
>   1 files changed, 13 insertions(+), 38 deletions(-)
>
>
> Remove the part of code that truncates the long DN.
>
> diff --git a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc 
> b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> --- a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> +++ b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
> @@ -21,6 +21,7 @@
>*/
>   #include 
>   
> +#include "osaf_utility.h"
>   #include "saAis.h"
>   #include "saLog.h"
>   #include "NtfAdmin.hh"
> @@ -232,48 +233,22 @@ SaAisErrorT NtfLogger::logNotification(N
>  notif->getNotificationId(),
>  SA_LOG_RECORD_WRITE_ACK,
>  &logRecord);
> -if (SA_AIS_OK != errorCode) {
> -  LOG_NO("Failed to log an alarm or security alarm notification (%d)", 
> errorCode);
> -  if (errorCode == SA_AIS_ERR_LIBRARY || errorCode == 
> SA_AIS_ERR_BAD_HANDLE) {
> -LOG_ER("Fatal error SA_AIS_ERR_LIBRARY or SA_AIS_ERR_BAD_HANDLE; 
> exiting (%d)...", errorCode);
> -exit(EXIT_FAILURE);
> -  } else if (errorCode == SA_AIS_ERR_INVALID_PARAM) {
> -/* Retry to log truncated notificationObject/notifyingObject because
> - * LOG Service has not supported long dn in Opensaf 4.5
> - */
> -char short_dn[SA_MAX_UNEXTENDED_NAME_LENGTH];
> -memset(&short_dn, 0, SA_MAX_UNEXTENDED_NAME_LENGTH);
> -SaNameT shortdn_notificationObject, shortdn_notifyingObject;
> -if (osaf_is_an_extended_name(ntfHeader->notificationObject)) {
> -  strncpy(short_dn, 
> osaf_extended_name_borrow(ntfHeader->notificationObject)
> -  , SA_MAX_UNEXTENDED_NAME_LENGTH - 1);
> -  osaf_extended_name_lend(short_dn, &shortdn_notificationObject);
> -  logRecord.logHeader.ntfHdr.notificationObject = 
> &shortdn_notificationObject;
> -}
> -if (osaf_is_an_extended_name(ntfHeader->notifyingObject)) {
> -  strncpy(short_dn, 
> osaf_extended_name_borrow(ntfHeader->notifyingObject)
> -  , SA_MAX_UNEXTENDED_NAME_LENGTH - 1);
> -  osaf_extended_name_lend(short_dn, &shortdn_notifyingObject);
> -  logRecord.logHeader.ntfHdr.notifyingObject = 
> &shortdn_notifyingObject;
> -}
> -if (short_dn[0] != '\0') {
> -  LOG_NO("Retry to log the truncated 
> notificationObject/notifyingObject");
> -  if ((errorCode = saLogWriteLogAsync(alarmStreamHandle,
> -  notif->getNotificationId(),
> -  SA_LOG_RECORD_WRITE_ACK,
> -  &logRecord)) != SA_AIS_OK) {
> -LOG_ER("Failed to log the truncated 
> notificationObject/notifyingObject (%d)"
> -   , errorCode);
> -  }
> -}
> -  }
> -  goto end;
> +switch (errorCode) {
> +case SA_AIS_OK:
> + break;
> +
> +/* LOGsv is busy. Put the notification to queue and re-send next time */
> +case SA_AIS_ERR_TRY_AGAIN:
> +case SA_AIS_ERR_TIMEOUT:
> + TRACE("Failed to log notification (ret: %d). Try next time.", 
> errorCode);
> + break;
> +
> +default:
> + osaf_abort(errorCode);
>   }
> }
>   
> -end:
> TRACE_LEAVE();
> -
> return errorCode;
>   }
>   
>


--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] AMFD: Initialize CLM, NTF handle in thread [#1828]

2016-07-24 Thread minh chau
I mean "... no objection ..."

Thanks,
Minh

On 25/07/16 10:53, minh chau wrote:
> Hi,
>
> I have tried to reproduce the problem in #1781, when amfd in 
> non-member node gets assigned role, amfd initializes CLM successfully. 
> Only sfmd, cpkt got UNAVAILABLE from saClmClusterNodeGet().
> Ander's explanation sounds reasonably that amfd should initialize CLM 
> after amfd get assigned role. If there's no comment on the patch and 
> no object to Anders' suggestion till Wednesday this week, I would like 
> to float V2 patch that incorporates Anders' suggestion.
>
> Thanks,
> Minh
> On 22/07/16 22:31, Anders Widell wrote:
>> Yes, that's what I meant when I said that if we really do wish to 
>> assign the STANDBY role to the directors running on a CLM locked 
>> node, then we need to make sure CLM can differentiate between 
>> middleware- and non-middleware clients. Middleware clients should 
>> then not be affected by the CLM node lock.
>>
>> regards,
>> Anders Widell
>>
>> On 07/22/2016 02:16 PM, praveen malviya wrote:
>>> Such a configured non-member node will be notified only for node 
>>> local changes in track callback and not cluster wide changes. This 
>>> is one reason clm lock of active controller node is not allowed by 
>>> CLM and AMF also rejects this in validation step.
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 22-Jul-16 5:11 PM, Anders Widell wrote:
>>>> Ok good, so then we there is no problem calling saClmInitialize_4() 
>>>> and
>>>> saClmSelectionObjectGet() on a locked node. I checked
>>>> saClmClusterTrack_4() and it should also be safe according to the 
>>>> spec.
>>>>
>>>> I would prefer if you don't initialize the CLM and NTF handles on 
>>>> spare
>>>> nodes, until they actually get a STANDBY or ACTIVE assignment. One
>>>> reason is performance - initializing a handle presumably consumes
>>>> resources on both the spare controller node as well as in the CLM/NTF
>>>> server running on the active node. We (currently) don't need the 
>>>> handles
>>>> as long as we are running as spares, so this is a waste of resources.
>>>> But more importantly, I think it is safer to defer the initialization
>>>> until the handles are needed. Let's suppose the spare starts up and
>>>> keeps running for a very long time as a spare. Then - much later - you
>>>> become STANDBY or ACTIVE. Now you want to start using those handles 
>>>> that
>>>> you initialized way back in history. Who knows if the handles are 
>>>> still
>>>> working? A handle is essentially connection to a server running on
>>>> another node. A lot of things may have happened since we 
>>>> initialized it;
>>>> controller switch-overs, fail-overs, software upgrades, and headless
>>>> situations. Yes the handles ought to work - and if not, they don't 
>>>> they
>>>> ought to return BAD_HANDLE so that we can re-initialize them. But 
>>>> there
>>>> is also a small possibility that a bug causes the handle to simply not
>>>> work. A "fresh", newly initialized handle would be safer than a 
>>>> one-year
>>>> old handle.
>>>>
>>>> Do you think it would work if you defer creating these background
>>>> threads until we get an ACTIVE/STANDBY assignment?
>>>>
>>>> regards,
>>>> Anders Widell
>>>>
>>>> On 07/22/2016 12:20 PM, minh chau wrote:
>>>>> Hi,
>>>>>
>>>>> I think Praveen's comment on #1812 was worrying about amfd hanging
>>>>> when init with CLM, this patch does not change position of CLM
>>>>> initialization and also it's done in thread so it will be ok?
>>>>> Regarding Anders' comment: I did quick test, lock clm on standby
>>>>> controller and reboot it, when it comes up it initializes with CLM
>>>>> successfully, so it seems we won't get ERR_UNAVAILABLE on configured
>>>>> non-member node
>>>>> Thought that handling ERR_UNAVAILABLE should be removed in CLM 
>>>>> init in
>>>>> the scope of this ticket, but would it be useful in case that amfd
>>>>> re-init CLM up on receiving BAD_HANDLE?
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>

Re: [devel] [PATCH 1 of 1] AMFD: Initialize CLM, NTF handle in thread [#1828]

2016-07-26 Thread minh chau
Hi Mathi,

I noticed that #1781 has moved the CLM init before amfd get assigned 
role. Would you be happy if I change amfd to make it does CLM init when 
it actually gets active/standby role?

Thanks,
Minh

On 25/07/16 11:25, minh chau wrote:
> I mean "... no objection ..."
>
> Thanks,
> Minh
>
> On 25/07/16 10:53, minh chau wrote:
>> Hi,
>>
>> I have tried to reproduce the problem in #1781, when amfd in 
>> non-member node gets assigned role, amfd initializes CLM 
>> successfully. Only sfmd, cpkt got UNAVAILABLE from 
>> saClmClusterNodeGet().
>> Ander's explanation sounds reasonably that amfd should initialize CLM 
>> after amfd get assigned role. If there's no comment on the patch and 
>> no object to Anders' suggestion till Wednesday this week, I would 
>> like to float V2 patch that incorporates Anders' suggestion.
>>
>> Thanks,
>> Minh
>> On 22/07/16 22:31, Anders Widell wrote:
>>> Yes, that's what I meant when I said that if we really do wish to 
>>> assign the STANDBY role to the directors running on a CLM locked 
>>> node, then we need to make sure CLM can differentiate between 
>>> middleware- and non-middleware clients. Middleware clients should 
>>> then not be affected by the CLM node lock.
>>>
>>> regards,
>>> Anders Widell
>>>
>>> On 07/22/2016 02:16 PM, praveen malviya wrote:
>>>> Such a configured non-member node will be notified only for node 
>>>> local changes in track callback and not cluster wide changes. This 
>>>> is one reason clm lock of active controller node is not allowed by 
>>>> CLM and AMF also rejects this in validation step.
>>>>
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>>
>>>> On 22-Jul-16 5:11 PM, Anders Widell wrote:
>>>>> Ok good, so then we there is no problem calling 
>>>>> saClmInitialize_4() and
>>>>> saClmSelectionObjectGet() on a locked node. I checked
>>>>> saClmClusterTrack_4() and it should also be safe according to the 
>>>>> spec.
>>>>>
>>>>> I would prefer if you don't initialize the CLM and NTF handles on 
>>>>> spare
>>>>> nodes, until they actually get a STANDBY or ACTIVE assignment. One
>>>>> reason is performance - initializing a handle presumably consumes
>>>>> resources on both the spare controller node as well as in the CLM/NTF
>>>>> server running on the active node. We (currently) don't need the 
>>>>> handles
>>>>> as long as we are running as spares, so this is a waste of resources.
>>>>> But more importantly, I think it is safer to defer the initialization
>>>>> until the handles are needed. Let's suppose the spare starts up and
>>>>> keeps running for a very long time as a spare. Then - much later - 
>>>>> you
>>>>> become STANDBY or ACTIVE. Now you want to start using those 
>>>>> handles that
>>>>> you initialized way back in history. Who knows if the handles are 
>>>>> still
>>>>> working? A handle is essentially connection to a server running on
>>>>> another node. A lot of things may have happened since we 
>>>>> initialized it;
>>>>> controller switch-overs, fail-overs, software upgrades, and headless
>>>>> situations. Yes the handles ought to work - and if not, they don't 
>>>>> they
>>>>> ought to return BAD_HANDLE so that we can re-initialize them. But 
>>>>> there
>>>>> is also a small possibility that a bug causes the handle to simply 
>>>>> not
>>>>> work. A "fresh", newly initialized handle would be safer than a 
>>>>> one-year
>>>>> old handle.
>>>>>
>>>>> Do you think it would work if you defer creating these background
>>>>> threads until we get an ACTIVE/STANDBY assignment?
>>>>>
>>>>> regards,
>>>>> Anders Widell
>>>>>
>>>>> On 07/22/2016 12:20 PM, minh chau wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I think Praveen's comment on #1812 was worrying about amfd hanging
>>>>>> when init with CLM, this patch does not change position of CLM
>>>>>> initialization and also it's done in thread so it will be ok?
>>>>>

Re: [devel] [PATCH 1 of 5] amfd: replace SaNameT with string in include dir [#1642]

2016-08-01 Thread minh chau
Hi Praveen,

One comment with [Minh] in line.

Thanks,
Minh

On 01/08/16 17:22, Gary Lee wrote:
> Hi Praveen
>
> On 1/08/2016 4:29 PM, praveen malviya wrote:
>> Hi Gary, Long,
>>
>> Some comments/observations:
>> -In AMFD saAisNameBorrow() is used in logging and AMFND uses 
>> osaf_extended_name_borrow().
>> For osaf_extended_name_borrow() note in osaf_extended_name.h says it 
>> is intended for mainly agent libraries. But middle-ware services 
>> always use core libs. At the same time saAisNameBorrow(), I think, is 
>> for application.
>> any reason of using them this way and what is the recommended way?
> I think I used both styles in amfd. I think we can change saAisNameXX 
> to osaf_extended_name_XX just before pushing, to make it consistent 
> with the rest of the OpenSAF services.
>> -I think, one case may arrive from upgrade perspective.
>> Suppose any application (say amf_demo app) is running without 
>> enabling long dn and a csi, with its RDn greater than 256, is added 
>> dynamically (long dn enabled in IMM). In this case AMFD will assign 
>> this csi to the running component. Component will not be able to read 
>> the CSI and may crash.
>> This is related to invocation of CSI_SET callback but same will be 
>> valid for PG tracking also. There may be other cases also.
>> Even truncation will not work in this case.
[Minh] I think the agent patch that Gary submitted currently returns 
SA_AIS_ERR_NAME_TOO_LONG in saAmfDispatch() if long DN callback comes to 
legacy application (unadapted long DN app). The real callback won't be 
issued but application may crash if it exit() on non-SA_AIS_OK from 
Dispatch(). I guess you have seen this with #1553? Do you think it's 
good way if amf agent drops the long DN callback and also Dispatch() 
returns OK to legacy app, and print error in syslog?
>> - While running some tests observed crashes in amfnd and amfd.
>> I will update #1642 with bt information.
> Minh will answer this bit.
>
> Thanks
> Gary
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-04 Thread minh chau
Hi Praveen,

There's a bug in V1 at AMFD side, so I floated V2.
The change is commented in V2:
"

This V2 avoid AMFD crash if scAbsence is not configured.
V2's diff (from V1) is at avd_process_state_info_queue()

"

So latest patches are below:

[PATCH 0 of 2] Review Request for AMF: Support admin operation 
continuation after headless [#1725 Part 1] V2
[PATCH 1 of 2] AMFD: Introduce new RTA states for admin operation 
continuation after headless [#1725 part 1] V2
[PATCH 2 of 2] AMFND: Admin operation continuation if csi callback 
completes during headless [#1725 part 1] V1

I also uploaded the patches V2 to ticket.
Sorry for inconvenience.

Thanks,
Minh

On 05/08/16 15:40, praveen malviya wrote:
> Hi Minh,
>
> Not all the patches are received. Also in the two received patches 
> contents are same but commit messages are different.
> If this because of size of the patches, please upload in the ticket.
>
>
> Thanks,
> Praveen
>
> On 05-Aug-16 2:50 AM, Minh Hon Chau wrote:
>>  osaf/services/saf/amf/amfnd/di.cc |  199 +
>>  osaf/services/saf/amf/amfnd/include/avnd_di.h |1 +
>>  2 files changed, 134 insertions(+), 66 deletions(-)
>>
>>
>> The patch buffers susi_resp_msg during headless stage and resend it 
>> to AMFD after
>> headless.
>>
>> diff --git a/osaf/services/saf/amf/amfnd/di.cc 
>> b/osaf/services/saf/amf/amfnd/di.cc
>> --- a/osaf/services/saf/amf/amfnd/di.cc
>> +++ b/osaf/services/saf/amf/amfnd/di.cc
>> @@ -804,11 +804,6 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>  if (cb->term_state == AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)
>>  return rc;
>>
>> -if (cb->is_avd_down == true) {
>> -m_AVND_SU_ALL_SI_RESET(su);
>> -return rc;
>> -}
>> -
>>  // should be in assignment pending state to be here
>>  osafassert(m_AVND_SU_IS_ASSIGN_PEND(su));
>>
>> @@ -819,64 +814,76 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>  TRACE_ENTER2("Sending Resp su=%s, si=%s, curr_state=%u, 
>> prv_state=%u", su->name.value, 
>> curr_si->name.value,curr_si->curr_state,curr_si->prv_state);
>>  /* populate the susi resp msg */
>>  msg.info.avd = new AVSV_DND_MSG();
>> -msg.type = AVND_MSG_AVD;
>> -msg.info.avd->msg_type = AVSV_N2D_INFO_SU_SI_ASSIGN_MSG;
>> -msg.info.avd->msg_info.n2d_su_si_assign.msg_id = 
>> ++(cb->snd_msg_id);
>> -msg.info.avd->msg_info.n2d_su_si_assign.node_id = 
>> cb->node_info.nodeId;
>> -if (si) {
>> - msg.info.avd->msg_info.n2d_su_si_assign.single_csi =
>> -((si->single_csi_add_rem_in_si == 
>> AVSV_SUSI_ACT_BASE) ? false : true);
>> -}
>> -TRACE("curr_assign_state '%u'", curr_si->curr_assign_state);
>> -msg.info.avd->msg_info.n2d_su_si_assign.msg_act =
>> - (m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> - m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(curr_si)) ?
>> -((!curr_si->prv_state) ? AVSV_SUSI_ACT_ASGN : 
>> AVSV_SUSI_ACT_MOD) : AVSV_SUSI_ACT_DEL;
>> -msg.info.avd->msg_info.n2d_su_si_assign.su_name = su->name;
>> -if (si) {
>> - msg.info.avd->msg_info.n2d_su_si_assign.si_name = si->name;
>> -if (AVSV_SUSI_ACT_ASGN == 
>> si->single_csi_add_rem_in_si) {
>> -TRACE("si->curr_assign_state '%u'", 
>> curr_si->curr_assign_state);
>> - msg.info.avd->msg_info.n2d_su_si_assign.msg_act =
>> - (m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> - m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(curr_si)) ?
>> -AVSV_SUSI_ACT_ASGN : AVSV_SUSI_ACT_DEL;
>> -}
>> -}
>> -msg.info.avd->msg_info.n2d_su_si_assign.ha_state =
>> -(SA_AMF_HA_QUIESCING == curr_si->curr_state) ? 
>> SA_AMF_HA_QUIESCED : curr_si->curr_state;
>> -msg.info.avd->msg_info.n2d_su_si_assign.error =
>> - (m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> - m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_REMOVED(curr_si)) ? 
>> NCSCC_RC_SUCCESS : NCSCC_RC_FAILURE;
>> +msg.type = AVND_MSG_AVD;
>> +msg.info.avd->msg_type = AVSV_N2D_INFO_SU_SI_ASSIGN_MSG;
>> +msg.info.avd->msg_info.n2d_su_si_assign.node_id = 
>> cb->node_info.nodeId;
>> +if (si) {
>> +msg.info.avd->msg_info.n2d_su_si_assign.single_csi =
>> +((si->single_csi_add_rem_in_si == 
>> AVSV_SUSI_ACT_BASE) ? false : true);
>> +}
>> +TRACE("curr_assign_state '%u'", curr_si->curr_assign_state);
>> +msg.info.avd->msg_info.n2d_su_si_assign.msg_act =
>> + (m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> + m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(curr_si)) ?
>> +((!curr_si->prv_state) ? AVSV_SUSI_ACT_ASGN : 
>> AVSV_SUSI_ACT_MOD) : AVSV_SUSI_ACT_DEL;
>> +msg.info.avd->msg_info.n2d_su_si_assign.su_name = su->name;
>> +if (si) {
>> +msg.info.avd->msg_info.n2d_su_si_assign.si_name = si->name;
>> +if (AVSV_SUSI_ACT_ASGN == si->single_csi_add_rem_in_si) {
>> +   

Re: [devel] [PATCH 1 of 1] amfd: do not send duplicate removal of assignment, 2N model [#1772]

2016-08-08 Thread minh chau
Hi Praveen,

This patch has also fixed the coredump in the other tests are failing in 
test report of #1725 part 1, which  are 14, 64, 68, 84, 124, 128
In the above test cases, still get "ER avd_sg_su_oper_list_del: su not 
found".
Can we change ER to WA?
Ack from me with this minor comment.

Thanks,
Minh

On 11/05/16 02:26, praveen.malv...@oracle.com wrote:
>   osaf/services/saf/amf/amfd/sg_2n_fsm.cc |  24 +++-
>   1 files changed, 19 insertions(+), 5 deletions(-)
>
>
> In the reported problem, AMFND asserted when SU was unlocked.
>
> For complete analysis, please refer ticket. In short, when AMFND was removing
> the assignments, it gets a duplicate removal of assignment for the same SU 
> because
> of reboot of node hosting the active su. This duplicate message gets buffered 
> and is picked
> up when ongoing removal completes. After completion of ongoing removal of 
> assignment, AMFND picks
> buffered assignment and sets assignment related flags. Since SUSIs were 
> deleted during previos
> removal, no callbacks processing and response to AMFD is done for it. During 
> response to AMFD,
> AMFND resets all assignment related flags and it remained undone for buffered 
> assignments.
> Later on when SU was unlocked and fresh assignments were given to it. After 
> completion of callback
> when AMFND tries to respond to AMFND expects valid SI pointer for fresh 
> assignment and checks it through
> a assert statement. Here AMFND asserts because of side effects of assignment 
> related flags being set.
>
> Patch fixes the problem by avoiding sending duplicate removal of assignments 
> to AMFND.
>
> diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc 
> b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
> --- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
> +++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
> @@ -3339,9 +3339,7 @@ void SG_2N::node_fail(AVD_CL_CB *cb, AVD
>   
>   if ((avd_su_state_determine(su) != SA_AMF_HA_STANDBY) &&
>   !((avd_su_state_determine(su) == SA_AMF_HA_QUIESCED) &&
> -   (avd_su_fsm_state_determine(su) == AVD_SU_SI_STATE_UNASGN)
> - )
> - ) {
> +   (avd_su_fsm_state_determine(su) == 
> AVD_SU_SI_STATE_UNASGN))) {
>   /* SU is not standby */
>   a_susi = avd_sg_2n_act_susi(cb, su->sg_of_su, &s_susi);
>   
> @@ -3388,11 +3386,27 @@ void SG_2N::node_fail(AVD_CL_CB *cb, AVD
>   } else {
>   /* the other SU has quiesced or standby 
> assigned and is in the
>* operation list and is out of service.
> -  * Send a D2N-INFO_SU_SI_ASSIGN with 
> remove all to that SU.
> +  * Send a D2N-INFO_SU_SI_ASSIGN with 
> remove all to that SU
> +  * if not sent already.
>* Remove this SU from operation list. 
> Free the
>* SU SI relationships of this SU.
>*/
> - avd_sg_su_si_del_snd(cb, o_su);
> +
> +
> + /*
> +As mentioned above other su (o_su) 
> is OOS for quiesced or
> +standby state, it means some admin 
> operation is going on it or
> +it has faulted (su level) which led 
> to OOS.
> +In this function, we are processing 
> node_fail of active/quiesced
> +su. These active/quiesced 
> assignments will be deleted because of
> +node fault and also other su cannot 
> be made active as it is OOS.
> +So AMF will have to remove 
> assignments of other su (o_su) also.
> +Since o_su is OOS, there is a 
> possibility that AMF would have
> +sent deletion of assignment to it 
> because of admin op or fault.
> +If not sent then send it now.
> +  */
> + if (all_unassigned(o_su) == false)
> + avd_sg_su_si_del_snd(cb, o_su);
>   su->delete_all_susis();
>   avd_sg_su_oper_list_del(cb, su, false);
>   m_AVD_CHK_OPLIST(o_su, flag);
>


--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming 

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-10 Thread minh chau
Hi Nagu,

Have you enabled IMM schema changes and adding these new attributes 
before upgrade?
This could be tested in the same way as previous additional attributes 
were introduced as before.

Thanks,
Minh

On 10/08/16 21:10, Nagendra Kumar wrote:
> Hi Minh,
>
> I was doing upgrade and downgrade, and I faced the following issue:
>
> Steps:
> #0 Both controllers are up(SC-1 Act, SC-2 Std) without the patch.
> #1 Stop standby controller(SC-2) and upgrade with the patch and start SC-2 as 
> Standby.
> #2 Perform SI switchover (2N SI), so that SC-2 is Act (with the patch) and 
> SC-1 is Standby(without the patch). Stop Standby controller(SC-1) and upgrade 
> with the patch and start SC-1 as Standby.
>
> Got error in syslog in SC-1 (Act):
> Aug 10 16:37:45 PM_SC-2 osafamfd[18234]: ER exec: create FAILED 12
> Aug 10 16:37:45 PM_SC-2 osafrded[18164]: NO Peer up on node 0x2010f
> Aug 10 16:37:45 PM_SC-2 osaffmd[18173]: NO clm init OK
> Aug 10 16:37:45 PM_SC-2 osaffmd[18173]: NO Peer clm node name: SC-1
> Aug 10 16:37:45 PM_SC-2 osafrded[18164]: NO Got peer info request from node 
> 0x2010f with role STANDBY
> Aug 10 16:37:45 PM_SC-2 osafrded[18164]: NO Got peer info response from node 
> 0x2010f with role STANDBY
> Aug 10 16:37:45 PM_SC-2 osafamfd[18234]: ER exec: create FAILED 12
>
>
> And there is no SUSI for SC-1:
> PM_SC-2:/home/nagu/views/staging-1725 # /etc/init.d/opensafd  status  
>   safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>  saAmfSISUHAState=ACTIVE(1)
>
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
>> Sent: 05 August 2016 02:50
>> To: hans.nordeb...@ericsson.com; Nagendra Kumar; Praveen Malviya;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au;
>> minh.c...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback
>> completes during headless [#1725 part 1] V1
>>
>>   osaf/services/saf/amf/amfnd/di.cc |  199 
>> +
>>   osaf/services/saf/amf/amfnd/include/avnd_di.h |1 +
>>   2 files changed, 134 insertions(+), 66 deletions(-)
>>
>>
>> The patch buffers susi_resp_msg during headless stage and resend it to
>> AMFD after headless.
>>
>> diff --git a/osaf/services/saf/amf/amfnd/di.cc
>> b/osaf/services/saf/amf/amfnd/di.cc
>> --- a/osaf/services/saf/amf/amfnd/di.cc
>> +++ b/osaf/services/saf/amf/amfnd/di.cc
>> @@ -804,11 +804,6 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>  if (cb->term_state ==
>> AVND_TERM_STATE_OPENSAF_SHUTDOWN_STARTED)
>>  return rc;
>>
>> -if (cb->is_avd_down == true) {
>> -m_AVND_SU_ALL_SI_RESET(su);
>> -return rc;
>> -}
>> -
>>  // should be in assignment pending state to be here
>>  osafassert(m_AVND_SU_IS_ASSIGN_PEND(su));
>>
>> @@ -819,64 +814,76 @@ uint32_t avnd_di_susi_resp_send(AVND_CB
>>  TRACE_ENTER2("Sending Resp su=%s, si=%s, curr_state=%u,
>> prv_state=%u", su->name.value, curr_si->name.value,curr_si-
>>> curr_state,curr_si->prv_state);
>>  /* populate the susi resp msg */
>>  msg.info.avd = new AVSV_DND_MSG();
>> -msg.type = AVND_MSG_AVD;
>> -msg.info.avd->msg_type = AVSV_N2D_INFO_SU_SI_ASSIGN_MSG;
>> -msg.info.avd->msg_info.n2d_su_si_assign.msg_id = ++(cb-
>>> snd_msg_id);
>> -msg.info.avd->msg_info.n2d_su_si_assign.node_id = cb-
>>> node_info.nodeId;
>> -if (si) {
>> -msg.info.avd->msg_info.n2d_su_si_assign.single_csi =
>> -((si->single_csi_add_rem_in_si == 
>> AVSV_SUSI_ACT_BASE) ?
>> false : true);
>> -}
>> -TRACE("curr_assign_state '%u'", curr_si->curr_assign_state);
>> -msg.info.avd->msg_info.n2d_su_si_assign.msg_act =
>> -(m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> - m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(curr_si)) ?
>> -((!curr_si->prv_state) ? AVSV_SUSI_ACT_ASGN :
>> AVSV_SUSI_ACT_MOD) : AVSV_SUSI_ACT_DEL;
>> -msg.info.avd->msg_info.n2d_su_si_assign.su_name = su->name;
>> -if (si) {
>> -msg.info.avd->msg_info.n2d_su_si_assign.si_name = si->name;
>> -if (AVSV_SUSI_ACT_ASGN == si->single_csi_add_rem_in_si) {
>> -TRACE("si->curr_assign_state '%u'", curr_si-
>>> curr_assign_state);
>> -msg.info.avd->msg_info.n2d_su_si_assign.msg_act =
>> -
>> (m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNED(curr_si) ||
>> -
>> m_AVND_SU_SI_CURR_ASSIGN_STATE_IS_ASSIGNING(curr_si)) ?
>> -AVSV_SUSI_ACT_ASGN : AVSV_SUSI_ACT_DEL;
>> -}
>> -}
>> -msg.info.avd->msg_info.n2d_su_si_assign.ha_state =
>> -(SA_AMF_HA_QUIESCING == curr_si->curr_state)

Re: [devel] [PATCH 0 of 2] Review Request for AMF: Support admin operation continuation after headless [#1725 Part 1] V2

2016-08-11 Thread minh chau
Hi Nagu, Praveen,

Can you please give me comments if you have any so far, that would help 
me revise some codes first while you can continue reviewing?
There are some changes in SG codes, I hope it doesn't break the SG's 
existing logic.

Thanks,
Minh

On 05/08/16 07:20, Minh Hon Chau wrote:
> Summary: AMF: Support admin operation continuation after headless [#1725 Part 
> 1] V2
> Review request for Trac Ticket(s): 1725
> Peer Reviewer(s): AMF devs
> Pull request to: <>
> Affected branch(es): default
> Development branch: default
>
> 
> Impacted area   Impact y/n
> 
>   Docsn
>   Build systemn
>   RPM/packaging   n
>   Configuration files n
>   Startup scripts n
>   SAF servicesy
>   OpenSAF servicesn
>   Core libraries  n
>   Samples n
>   Tests   n
>   Other   n
>
>
> Comments (indicate scope for each "y" above):
> -
> This V2 avoid AMFD crash if scAbsence is not configured.
> V2's diff (from V1) is at avd_process_state_info_queue()
>
> changeset 2215120caf950daa78927142aadebc27fda9d8b4
> Author:   Minh Hon Chau 
> Date: Fri, 05 Aug 2016 07:13:09 +1000
>
>   AMFD: Introduce new RTA states for admin operation continuation after
>   headless [#1725 part 1] V2
>
>   If there's an admin operation running and at that time cluster goes into
>   headless stage, the normal admin operation sequence is interrupted. 
> Since
>   both SCs are down, the SI assignments at AMFND could be on going or
>   completed during headless period. After headless this admin operation 
> should
>   be continued. This patch series supports the admin operation 
> continuation
>   after headless.
>
>   To resume the admin operation after headless, the states need to be 
> restored
>   are: SUSI fsm states, SG fsm states, SI Dependency states (not 
> suppported in
>   this patch), and SU operation list in SG at the time cluster goes 
> headless.
>
>   At this moment, the SG fsm states are set variously in each specific SG
>   models. Also, the rule that a SU to be added in SG's operation list is 
> not
>   consistent. A SU is added to operation list after AMFD sends 
> su_si_assign
>   event on this SU in most of the places. However, there're are some 
> scenarios
>   that a SU is added to the list for other purposes. These difficulties 
> make
>   the state logic deduction hard to implemenent.
>
>   This patch introduces new RTA states: saAmfSGSuOperationList,
>   saAmfSGFsmState, saAmfSISUFsmState to capture the SG's operation list, 
> SG
>   Fsm state, SUSI fsm state in AMFD's memory to IMM during AMFD's 
> lifetime. If
>   cluster comes back from headless, these RTA will read from IMM to 
> restore
>   states in AMFD's memory. After this patch, if admin operation 
> interrupts to
>   headless stage, and csi callback is responded after headless, the admin
>   operation can continue. The other patch in this series will help admin
>   operation continuation if a csi callback completes during headless.
>
> changeset 7a016215ab72d6a8a6e66c2cbd55c8cd3d15c3f9
> Author:   Minh Hon Chau 
> Date: Fri, 05 Aug 2016 07:13:09 +1000
>
>   AMFND: Admin operation continuation if csi callback completes during
>   headless [#1725 part 1] V1
>
>   The patch buffers susi_resp_msg during headless stage and resend it to 
> AMFD
>   after headless.
>
>
> Complete diffstat:
> --
>   osaf/services/saf/amf/amfd/cluster.cc |4 +
>   osaf/services/saf/amf/amfd/csi.cc |   38 
>   osaf/services/saf/amf/amfd/imm.cc |5 +-
>   osaf/services/saf/amf/amfd/include/csi.h  |1 -
>   osaf/services/saf/amf/amfd/include/imm.h  |5 +-
>   osaf/services/saf/amf/amfd/include/sg.h   |6 +-
>   osaf/services/saf/amf/amfd/include/su.h   |3 +-
>   osaf/services/saf/amf/amfd/include/susi.h |6 +-
>   osaf/services/saf/amf/amfd/include/util.h |2 +
>   osaf/services/saf/amf/amfd/ndfsm.cc   |   11 ++-
>   osaf/services/saf/amf/amfd/role.cc|6 -
>   osaf/services/saf/amf/amfd/sg.cc  |  110 
> +-
>   osaf/services/saf/amf/amfd/sg_2n_fsm.cc   |8 +-
>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc  |2 +-
>   osaf/services/saf/amf/amfd/sg_nwayact_fsm.cc  |2 +-
>   osaf/services/saf/amf/amfd/sgproc.cc  |  140 
> +--
>   osaf/services/saf/amf/amfd/siass.cc   |  204 
> --
>   osaf/services/saf/amf/amfd/su.cc  |   46 +-
>   osaf/services/saf/amf/amfnd/di.cc |  199 
> ++---
>   osaf/services/saf/amf/

Re: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of sync if peer amfd is absent [#1850]

2016-08-16 Thread minh chau
Hi Nagu,

I got this.

   CXX  osafamfd-sg_2n_fsm.o
   CXX  osafamfd-sg_nored_fsm.o
sg_2n_fsm.cc: In member function ‘virtual SaAisErrorT 
SG_2N::si_swap(AVD_SI*, SaInvocationT)’:
sg_2n_fsm.cc:775:6: error: ‘cb’ was not declared in this scope
 ((cb->node_id_avd_other != 0) && (cb->other_avd_adest != 0))) {
   ^
Makefile:1176: recipe for target 'osafamfd-sg_2n_fsm.o' failed
make[7]: *** [osafamfd-sg_2n_fsm.o] Error 1

Thanks,
Minh

On 16/08/16 19:32, Nagendra Kumar wrote:
> Please review it by this weekend.
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: Nagendra Kumar
>> Sent: 02 August 2016 17:34
>> To: hans.nordeb...@ericsson.com; Praveen Malviya;
>> minh.c...@dektech.com.au; gary@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of sync if
>> peer amfd is absent [#1850]
>>
>>   osaf/services/saf/amf/amfd/main.cc  |  2 +-
>>   osaf/services/saf/amf/amfd/sg_2n_fsm.cc |  3 ++-
>>   osaf/services/saf/amf/amfd/sgproc.cc|  8 +---
>>   3 files changed, 4 insertions(+), 9 deletions(-)
>>
>>
>> If standby amfd is not available then stby_sync_state should be in out of 
>> sync
>> state.
>> Else, Amfd should be in out of sync state.
>> This is to avoid issues like 1841
>>
>> diff --git a/osaf/services/saf/amf/amfd/main.cc
>> b/osaf/services/saf/amf/amfd/main.cc
>> --- a/osaf/services/saf/amf/amfd/main.cc
>> +++ b/osaf/services/saf/amf/amfd/main.cc
>> @@ -542,7 +542,7 @@ static uint32_t initialize(void)
>>  cb->fully_initialized = false;
>>  cb->swap_switch = false;
>>  cb->active_services_exist = true;
>> -cb->stby_sync_state = AVD_STBY_IN_SYNC;
>> +cb->stby_sync_state = AVD_STBY_OUT_OF_SYNC;
>>  cb->sync_required = true;
>>
>>  cb->heartbeat_tmr.is_active = false;
>> diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>> b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>> --- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>> +++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>> @@ -771,7 +771,8 @@ SaAisErrorT SG_2N::si_swap(AVD_SI *si, S
>>  goto done;
>>  }
>>
>> -if (si->sg_of_si->sg_ncs_spec) {
>> +if ((si->sg_of_si->sg_ncs_spec) &&
>> +((cb->node_id_avd_other != 0) && (cb-
>>> other_avd_adest != 0))) {
>>  if (avd_cb->stby_sync_state == AVD_STBY_OUT_OF_SYNC) {
>>  LOG_ER("%s SWAP failed - Cold sync in progress", si-
>>> name.value);
>>  rc = SA_AIS_ERR_TRY_AGAIN;
>> diff --git a/osaf/services/saf/amf/amfd/sgproc.cc
>> b/osaf/services/saf/amf/amfd/sgproc.cc
>> --- a/osaf/services/saf/amf/amfd/sgproc.cc
>> +++ b/osaf/services/saf/amf/amfd/sgproc.cc
>> @@ -1997,14 +1997,8 @@ void avd_node_down_mw_susi_failover(AVD_
>>  if ((i_su->sg_of_su->sg_redundancy_model ==
>> SA_AMF_2N_REDUNDANCY_MODEL) &&
>>  (i_su->sg_of_su->sg_fsm_state ==
>> AVD_SG_FSM_STABLE))
>>  (void) avd_clm_track_start();
>> -/* If Std ctlr went down in middle of Cold sync, then we need
>> -   to reset the sync state to IN_SYNC. */
>> -if ((i_su->sg_of_su->sg_redundancy_model ==
>> SA_AMF_2N_REDUNDANCY_MODEL) &&
>> -(cb->stby_sync_state ==
>> AVD_STBY_OUT_OF_SYNC)) {
>> -TRACE("Marking sync_state as in_sync");
>> -cb->stby_sync_state = AVD_STBY_IN_SYNC;
>> -}
>>  /* Free all the SU SI assignments*/
>> +
>>  i_su->delete_all_susis();
>>
>>  }   /* for (const auto& i_su : avnd->list_of_su) */
>>
>> --
>> ___
>> Opensaf-devel mailing list
>> Opensaf-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] AMFD: Correct the size of synchronizing node after headless [#1984]

2016-08-16 Thread minh chau
Thanks Hans, I will correct it before push.

On 10/08/16 22:15, Hans Nordebäck wrote:
> Ack, code review only. The ticket no is incorrect in the subject should be 
> #1894/Thanks HansN
>
> -Original Message-
> From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
> Sent: den 29 juni 2016 02:48
> To: Hans Nordebäck ; nagendr...@oracle.com; 
> praveen.malv...@oracle.com; Gary Lee 
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: [PATCH 1 of 1] AMFD: Correct the size of synchronizing node after 
> headless [#1984]
>
>   osaf/services/saf/amf/amfd/ndfsm.cc |  2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
>
> If more than 2 payloads are joining from headless, amfd will think all nodes 
> have already been synced even there's still one being in headless sync period.
>
> The patch corrects the conditional statement that fixes the size of nodes 
> being synchronized.
>
> diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc 
> b/osaf/services/saf/amf/amfd/ndfsm.cc
> --- a/osaf/services/saf/amf/amfd/ndfsm.cc
> +++ b/osaf/services/saf/amf/amfd/ndfsm.cc
> @@ -298,7 +298,7 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_
>   uint32_t rc_node_up;
>   avnd->node_up_msg_count++;
>   rc_node_up = avd_count_node_up(cb);
> - if (rc_node_up == sync_nd_size-1) {
> + if (rc_node_up == sync_nd_size) {
>   if (cb->node_sync_tmr.is_active) {
>   avd_stop_tmr(cb, &cb->node_sync_tmr);
>   TRACE("stop NodeSync timer");
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of sync if peer amfd is absent [#1850]

2016-08-16 Thread minh chau
Hi Nagu,

I think I can just replace "cb" by "avd_cb", and testing the patch. Then 
you can correct it later.

Thanks,
Minh

On 17/08/16 12:50, minh chau wrote:
> Hi Nagu,
>
> I got this.
>
>   CXX  osafamfd-sg_2n_fsm.o
>   CXX  osafamfd-sg_nored_fsm.o
> sg_2n_fsm.cc: In member function ‘virtual SaAisErrorT 
> SG_2N::si_swap(AVD_SI*, SaInvocationT)’:
> sg_2n_fsm.cc:775:6: error: ‘cb’ was not declared in this scope
> ((cb->node_id_avd_other != 0) && (cb->other_avd_adest != 0))) {
>   ^
> Makefile:1176: recipe for target 'osafamfd-sg_2n_fsm.o' failed
> make[7]: *** [osafamfd-sg_2n_fsm.o] Error 1
>
> Thanks,
> Minh
>
> On 16/08/16 19:32, Nagendra Kumar wrote:
>> Please review it by this weekend.
>>
>> Thanks
>> -Nagu
>>
>>> -Original Message-
>>> From: Nagendra Kumar
>>> Sent: 02 August 2016 17:34
>>> To: hans.nordeb...@ericsson.com; Praveen Malviya;
>>> minh.c...@dektech.com.au; gary@dektech.com.au
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of 
>>> sync if
>>> peer amfd is absent [#1850]
>>>
>>>   osaf/services/saf/amf/amfd/main.cc  |  2 +-
>>>   osaf/services/saf/amf/amfd/sg_2n_fsm.cc |  3 ++-
>>>   osaf/services/saf/amf/amfd/sgproc.cc|  8 +---
>>>   3 files changed, 4 insertions(+), 9 deletions(-)
>>>
>>>
>>> If standby amfd is not available then stby_sync_state should be in 
>>> out of sync
>>> state.
>>> Else, Amfd should be in out of sync state.
>>> This is to avoid issues like 1841
>>>
>>> diff --git a/osaf/services/saf/amf/amfd/main.cc
>>> b/osaf/services/saf/amf/amfd/main.cc
>>> --- a/osaf/services/saf/amf/amfd/main.cc
>>> +++ b/osaf/services/saf/amf/amfd/main.cc
>>> @@ -542,7 +542,7 @@ static uint32_t initialize(void)
>>>   cb->fully_initialized = false;
>>>   cb->swap_switch = false;
>>>   cb->active_services_exist = true;
>>> -cb->stby_sync_state = AVD_STBY_IN_SYNC;
>>> +cb->stby_sync_state = AVD_STBY_OUT_OF_SYNC;
>>>   cb->sync_required = true;
>>>
>>>   cb->heartbeat_tmr.is_active = false;
>>> diff --git a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>> b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>> --- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>> +++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>> @@ -771,7 +771,8 @@ SaAisErrorT SG_2N::si_swap(AVD_SI *si, S
>>>   goto done;
>>>   }
>>>
>>> -if (si->sg_of_si->sg_ncs_spec) {
>>> +if ((si->sg_of_si->sg_ncs_spec) &&
>>> +((cb->node_id_avd_other != 0) && (cb-
>>>> other_avd_adest != 0))) {
>>>   if (avd_cb->stby_sync_state == AVD_STBY_OUT_OF_SYNC) {
>>>   LOG_ER("%s SWAP failed - Cold sync in progress", si-
>>>> name.value);
>>>   rc = SA_AIS_ERR_TRY_AGAIN;
>>> diff --git a/osaf/services/saf/amf/amfd/sgproc.cc
>>> b/osaf/services/saf/amf/amfd/sgproc.cc
>>> --- a/osaf/services/saf/amf/amfd/sgproc.cc
>>> +++ b/osaf/services/saf/amf/amfd/sgproc.cc
>>> @@ -1997,14 +1997,8 @@ void avd_node_down_mw_susi_failover(AVD_
>>>   if ((i_su->sg_of_su->sg_redundancy_model ==
>>> SA_AMF_2N_REDUNDANCY_MODEL) &&
>>>   (i_su->sg_of_su->sg_fsm_state ==
>>> AVD_SG_FSM_STABLE))
>>>   (void) avd_clm_track_start();
>>> -/* If Std ctlr went down in middle of Cold sync, then we need
>>> -   to reset the sync state to IN_SYNC. */
>>> -if ((i_su->sg_of_su->sg_redundancy_model ==
>>> SA_AMF_2N_REDUNDANCY_MODEL) &&
>>> -(cb->stby_sync_state ==
>>> AVD_STBY_OUT_OF_SYNC)) {
>>> -TRACE("Marking sync_state as in_sync");
>>> -cb->stby_sync_state = AVD_STBY_IN_SYNC;
>>> -}
>>>   /* Free all the SU SI assignments*/
>>> +
>>>   i_su->delete_all_susis();
>>>
>>>   }/* for (const auto& i_su : avnd->list_of_su) */
>>>
>>> --
>>>  
>>>
>>> ___
>>> Opensaf-devel mailing list
>>> Opensaf-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of sync if peer amfd is absent [#1850]

2016-08-17 Thread minh chau
Ack from me (code review only)

Thanks,
Minh

On 17/08/16 16:39, Nagendra Kumar wrote:
> Hi Minh,
>   Please change cb to avd_cb and then test. Sorry for the typo.
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: minh chau [mailto:minh.c...@dektech.com.au]
>> Sent: 17 August 2016 08:20
>> To: Nagendra Kumar; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of sync if
>> peer amfd is absent [#1850]
>>
>> Hi Nagu,
>>
>> I got this.
>>
>> CXX  osafamfd-sg_2n_fsm.o
>> CXX  osafamfd-sg_nored_fsm.o
>> sg_2n_fsm.cc: In member function 'virtual SaAisErrorT
>> SG_2N::si_swap(AVD_SI*, SaInvocationT)':
>> sg_2n_fsm.cc:775:6: error: 'cb' was not declared in this scope
>>   ((cb->node_id_avd_other != 0) && (cb->other_avd_adest != 0))) {
>> ^
>> Makefile:1176: recipe for target 'osafamfd-sg_2n_fsm.o' failed
>> make[7]: *** [osafamfd-sg_2n_fsm.o] Error 1
>>
>> Thanks,
>> Minh
>>
>> On 16/08/16 19:32, Nagendra Kumar wrote:
>>> Please review it by this weekend.
>>>
>>> Thanks
>>> -Nagu
>>>
>>>> -Original Message-
>>>> From: Nagendra Kumar
>>>> Sent: 02 August 2016 17:34
>>>> To: hans.nordeb...@ericsson.com; Praveen Malviya;
>>>> minh.c...@dektech.com.au; gary@dektech.com.au
>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>> Subject: [devel] [PATCH 1 of 1] amfd: mark stby_sync_state out of
>>>> sync if peer amfd is absent [#1850]
>>>>
>>>>osaf/services/saf/amf/amfd/main.cc  |  2 +-
>>>>osaf/services/saf/amf/amfd/sg_2n_fsm.cc |  3 ++-
>>>>osaf/services/saf/amf/amfd/sgproc.cc|  8 +---
>>>>3 files changed, 4 insertions(+), 9 deletions(-)
>>>>
>>>>
>>>> If standby amfd is not available then stby_sync_state should be in
>>>> out of sync state.
>>>> Else, Amfd should be in out of sync state.
>>>> This is to avoid issues like 1841
>>>>
>>>> diff --git a/osaf/services/saf/amf/amfd/main.cc
>>>> b/osaf/services/saf/amf/amfd/main.cc
>>>> --- a/osaf/services/saf/amf/amfd/main.cc
>>>> +++ b/osaf/services/saf/amf/amfd/main.cc
>>>> @@ -542,7 +542,7 @@ static uint32_t initialize(void)
>>>>cb->fully_initialized = false;
>>>>cb->swap_switch = false;
>>>>cb->active_services_exist = true;
>>>> -  cb->stby_sync_state = AVD_STBY_IN_SYNC;
>>>> +  cb->stby_sync_state = AVD_STBY_OUT_OF_SYNC;
>>>>cb->sync_required = true;
>>>>
>>>>cb->heartbeat_tmr.is_active = false; diff --git
>>>> a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>>> b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>>> --- a/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>>> +++ b/osaf/services/saf/amf/amfd/sg_2n_fsm.cc
>>>> @@ -771,7 +771,8 @@ SaAisErrorT SG_2N::si_swap(AVD_SI *si, S
>>>>goto done;
>>>>}
>>>>
>>>> -  if (si->sg_of_si->sg_ncs_spec) {
>>>> +  if ((si->sg_of_si->sg_ncs_spec) &&
>>>> +  ((cb->node_id_avd_other != 0) && (cb-
>>>>> other_avd_adest != 0))) {
>>>>if (avd_cb->stby_sync_state == AVD_STBY_OUT_OF_SYNC) {
>>>>LOG_ER("%s SWAP failed - Cold sync in 
>>>> progress", si-
>>>>> name.value);
>>>>rc = SA_AIS_ERR_TRY_AGAIN;
>>>> diff --git a/osaf/services/saf/amf/amfd/sgproc.cc
>>>> b/osaf/services/saf/amf/amfd/sgproc.cc
>>>> --- a/osaf/services/saf/amf/amfd/sgproc.cc
>>>> +++ b/osaf/services/saf/amf/amfd/sgproc.cc
>>>> @@ -1997,14 +1997,8 @@ void
>> avd_node_down_mw_susi_failover(AVD_
>>>>if ((i_su->sg_of_su->sg_redundancy_model ==
>>>> SA_AMF_2N_REDUNDANCY_MODEL) &&
>>>>(i_su->sg_of_su->sg_fsm_state ==
>>>> AVD_SG_FSM_STABLE))
>>>>(void) avd_clm_track_start();
>>>> -  /* If Std ctlr went down in m

Re: [devel] [PATCH 4 of 4] AMFD: Validate headless cached RTA read from IMM [#1725]

2016-08-19 Thread minh chau
Hi Praveen,

I attached them to ticket.

Thanks,
Minh

On 19/08/16 21:08, praveen malviya wrote:
> Hi Minh,
> All patches are not received.
> Please attached them in the ticket.
>
> Thanks,
> Praveen
>
> On 18-Aug-16 5:45 AM, Minh Hon Chau wrote:
>>  osaf/services/saf/amf/amfd/include/sg.h |   4 +-
>>  osaf/services/saf/amf/amfd/include/susi.h |   2 +
>>  osaf/services/saf/amf/amfd/ndfsm.cc   |  15 ++-
>>  osaf/services/saf/amf/amfd/sg.cc  |  37 ++-
>>  osaf/services/saf/amf/amfd/siass.cc   |  59 
>> +-
>>  osaf/services/saf/amf/amfd/su.cc  |  12 ++
>>  6 files changed, 121 insertions(+), 8 deletions(-)
>>
>>
>> Since headless interuption is unplanned action and writing rta to IMM
>> is currently queued up in AMFD implemenentation. That can result into
>> inappropriate states of SG fsm state, SUSI fsm state, ha state,
>> SUOperationList, etc. Eventually, AMFD will run into SG unstable, false
>> assertion, or even SUSIs become permanently PARTIALLY, which is hard
>> to debug (even harder without trace)
>>
>> This patch adds a validation routine to check headless cached RTAs read
>> from IMM, more validation rule to be added. Also, a TODO is left for
>> discussion about what's a action should be taken if validation is 
>> failed.
>>
>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h 
>> b/osaf/services/saf/amf/amfd/include/sg.h
>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>> +++ b/osaf/services/saf/amf/amfd/include/sg.h
>> @@ -418,7 +418,7 @@ public:
>>  bool any_assignment_absent();
>>  void failover_absent_assignment();
>>  bool ng_using_saAmfSGAdminState;
>> -
>> +bool headless_validation;
>>  uint32_t term_su_list_in_reverse();
>> //Runtime calculates value of saAmfSGNumCurrAssignedSUs;
>>  uint32_t curr_assigned_sus() const;
>> @@ -579,7 +579,7 @@ private:
>>  #define m_AVD_CHK_OPLIST(i_su,flag) (flag) = 
>> (i_su)->sg_of_su->in_su_oper_list(i_su)
>>
>>  void avd_sg_read_headless_cached_rta(AVD_CL_CB *cb);
>> -
>> +bool avd_sg_validate_headless_cached_rta(AVD_CL_CB *cb);
>>  extern void avd_sg_delete(AVD_SG *sg);
>>  extern void avd_sg_db_add(AVD_SG *sg);
>>  extern void avd_sg_db_remove(AVD_SG *sg);
>> diff --git a/osaf/services/saf/amf/amfd/include/susi.h 
>> b/osaf/services/saf/amf/amfd/include/susi.h
>> --- a/osaf/services/saf/amf/amfd/include/susi.h
>> +++ b/osaf/services/saf/amf/amfd/include/susi.h
>> @@ -143,6 +143,8 @@ AVD_SU_SI_REL *avd_susi_create(AVD_CL_CB
>>  AVD_SU_SI_STATE default_fsm = 
>> AVD_SU_SI_STATE_ABSENT);
>>  AVD_SU_SI_REL *avd_susi_find(AVD_CL_CB *cb, const SaNameT *su_name, 
>> const SaNameT *si_name);
>>  void avd_susi_update_fsm(AVD_SU_SI_REL *susi, AVD_SU_SI_STATE 
>> new_fsm_state);
>> +bool avd_susi_validate_headless_cached_rta(AVD_SU_SI_REL *present_susi,
>> +SaAmfHAStateT ha_fr_imm, AVD_SU_SI_STATE fsm_fr_imm);
>>  void avd_susi_read_headless_cached_rta(AVD_CL_CB *cb);
>>  extern void avd_susi_update(AVD_SU_SI_REL *susi, SaAmfHAStateT 
>> ha_state);
>>
>> diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc 
>> b/osaf/services/saf/amf/amfd/ndfsm.cc
>> --- a/osaf/services/saf/amf/amfd/ndfsm.cc
>> +++ b/osaf/services/saf/amf/amfd/ndfsm.cc
>> @@ -127,13 +127,22 @@ void avd_process_state_info_queue(AVD_CL
>>
>>  // Read cached rta from Imm, the order of calling
>>  // below functions is IMPORTANT.
>> -// Reading sg must be after reading susi
>> -// Cleanup compcsi must be after reading sg
>>  if (found_state_info == true) {
>> +LOG_NO("Enter restore headless cached RTAs from IMM");
>> +// Read all cached susi, includes ABSENT SUSI with IMM fsm 
>> state
>>  avd_susi_read_headless_cached_rta(cb);
>> +// Read SUSwitch of SU, validate toggle depends on SUSI fsm 
>> state
>> +avd_su_read_headless_cached_rta(cb);
>> +// Read SUOperationList, set ABSENT fsm state for ABSENT SUSI
>>  avd_sg_read_headless_cached_rta(cb);
>> +// Clean compcsi object of ABSENT SUSI
>>  avd_compcsi_cleanup_imm_object(cb);
>> -avd_su_read_headless_cached_rta(cb);
>> +// Last, validate all
>> +bool valid = avd_sg_validate_headless_cached_rta(cb);
>> +if (valid)
>> +LOG_NO("Leave reading headless cached RTAs from IMM: 
>> SUCCESS");
>> +else
>> +LOG_ER("Leave reading headless cached RTAs from IMM: 
>> FAILED");
>>  }
>>  done:
>>  TRACE("queue_size after processing: %lu", (unsigned long) 
>> cb->evt_queue.size());
>> diff --git a/osaf/services/saf/amf/amfd/sg.cc 
>> b/osaf/services/saf/amf/amfd/sg.cc
>> --- a/osaf/services/saf/amf/amfd/sg.cc
>> +++ b/osaf/services/saf/amf/amfd/sg.cc
>> @@ -124,7 +124,8 @@ AVD_SG::AVD_SG():
>>  max_assigned_su(nullptr),
>>  min_assigned_su(nullptr),
>>  si_tobe_redistributed(nullptr),
>> -try_inst_counter(0)
>> +try_inst_counter(0

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-20 Thread minh chau
Hi Long, Praveen,

Regarding this TODO
+  if(notification) {
+// TODO (minhchau): memleak if notification is an array
+osaf_extended_name_free(¬ification->member.compName);
  free(notification);
+  }

Client currently uses saAmfProtectionGroupNotificationFree_4(handle, 
buff->notification) to free the notification in buffer.
If @buff->notification is a list of shortDn only, that should work as 
before, as agent will call this inside 
saAmfProtectionGroupNotificationFree_4

 /* free memory */
 if(notification)
 free(notification);

It will cause memory leak if @buff->notification contains a list of 
longDN notifications.
The leak is longDn of compName in each notification after the the first 
one in the array @buff->notification.

Agent can add a sentinel element when agent allocates 
@buff->notification, set this last element as NULL
In Free() API, agent could iterate and free longDn in each element of 
array @buff->notification until agent reaches NULL element.

Do you think it could work?

Thanks,
Minh

On 19/08/16 21:13, Long Nguyen wrote:
> Hi Praveen,
>
> Please see my answers marked with [Long].
>
> Best regards,
> Long Nguyen.
>
> On 8/19/2016 6:01 PM, praveen malviya wrote:
>> Hi Long,
>>
>> I see one problem if B.01.01 application frees the memory in pg 
>> tracking callback.
>> Please see inline.
>>
>> Thanks,
>> Praveen
>> On 19-Aug-16 12:00 PM, Long HB Nguyen wrote:
>>>  osaf/libs/agents/saf/amfa/amf_agent.cc | 1 +
>>>  osaf/libs/agents/saf/amfa/ava_hdl.cc   |  2 --
>>>  2 files changed, 1 insertions(+), 2 deletions(-)
>>>
>>>
>>> diff --git a/osaf/libs/agents/saf/amfa/amf_agent.cc 
>>> b/osaf/libs/agents/saf/amfa/amf_agent.cc
>>> --- a/osaf/libs/agents/saf/amfa/amf_agent.cc
>>> +++ b/osaf/libs/agents/saf/amfa/amf_agent.cc
>>> @@ -2450,6 +2450,7 @@ SaAisErrorT AmfAgent::ProtectionGroupTra
>>>ava_cpy_protection_group_ntf(buf->notification, 
>>> rsp_buf->notification,
>>> buf->numberOfItems, 
>>> SA_AMF_HARS_READY_FOR_ASSIGNMENT);
>>>rc = SA_AIS_ERR_NO_SPACE;
>>> +  buf->numberOfItems = rsp_buf->numberOfItems;
>>>  }
>>>} else {/* if(create_memory == false) */
>>>
>>> diff --git a/osaf/libs/agents/saf/amfa/ava_hdl.cc 
>>> b/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>> --- a/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>> +++ b/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>> @@ -697,7 +697,6 @@ uint32_t ava_hdl_cbk_rec_prc(AVSV_AMF_CB
>>> ((SaAmfCallbacksT_4*)reg_cbk)->saAmfProtectionGroupTrackCallback(&pg_track->csi_name,
>>>  
>>>
>>>  &buf,
>>> pg_track->mem_num, pg_track->err);
>>> -free(buf.notification);
>>>  } else {
>>>  pg_track->err = SA_AIS_ERR_NO_MEMORY;
>>>  LOG_CR("Notification is NULL: Invoking 
>>> PGTrack Callback with error SA_AIS_ERR_NO_MEMORY");
>>> @@ -740,7 +739,6 @@ uint32_t ava_hdl_cbk_rec_prc(AVSV_AMF_CB
>>>  ((SaAmfCallbacksT 
>>> *)reg_cbk)->saAmfProtectionGroupTrackCallback(&pg_track->csi_name,
>>> &buf,
>>> pg_track->mem_num, pg_track->err);
>>> -free(buf.notification);
>> For B.04.01 API, saAmfProtectionGroupNotificationFree_4() is taking 
>> care of freeing any extended name. For 
>> saAmfProtectionGroupNotificationFree(), it is the application's 
>> responsibility to free the memory. But how it will free any extended 
>> name.
>> I think there is no API equivalent to osaf_extended_name_free() for 
>> application. Is there any way?
>> [Long] I think we can do somethings like in applications:
>> if (strlen(saAisNameBorrow(buff.notification[i].member.comp_name)) > 
>> SA_MAX_UNEXTENDED_NAME_LENGTH)
>> free(saAisNameBorrow(buff.notification[i].member.comp_name));
>>
>> Otherwise we can document it that:
>> -if compName is not long dn in the notification buffer, then 
>> application has to free the memory.This will provide backward 
>> compatibility and spec compliance.
>>
>>  -if compName is longdn in notification then application should not 
>> free the memory. Agent will free the memory after callback is 
>> completed. So any B.01.01 application adapting to long dn will take 
>> care of this when modifying the application. In that case we need to 
>> do something like this :
>> for i in notificationBuffer->notification[i]
>>   if 
>> (osaf_is_an_extended_name(notificationBuffer->notification[i].member.compName)){
>> longdn_found = true;
>> osaf_extended_name_free();
>> }
>> }
>> if(longdn_found)
>> free(buf.notification)
>>
>>
>> Also there is a Todo in amf_agent.cc " // TODO (minhchau): memleak if 
>> notification is an array".
>> [Long] Thanks, I will check it.
>>
>> Thanks,
>> Praveen
>>
>>>  } else {
>>>  pg_track->err = SA_AIS_ERR_NO_MEMORY;
>>>

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-21 Thread minh chau
Hi Praveen,

The problem with B.04.01 is the API: 
saAmfProtectionGroupNotificationFree_4(SaAmfHandleT hdl, 
SaAmfProtectionGroupNotificationT_4 *notification) does not have 
numberOfItems.
Agent does not know how many element in *notification, each of element 
can hide a longDn inside it.

Thanks,
Minh


On 22/08/16 15:04, praveen malviya wrote:
> Hi Minh,
>
> SaAmfProtectionGroupNotificationBufferT_4() contains numberOfItems to 
> iterate over. In case of B.04.01, it should be simple as agent can 
> call direclty osaf_extended_name_free() during iteration inside 
> saAmfProtectionGroupNotificationFree_4(). So I think, only a for loop 
> which will iterate over numberOfItems is required.
>
> Problem was in B.01.01 case, where application will have to iterate 
> and free the memory. For this, Long has already suggested and that 
> needs to be documented.
>
>
> Thanks,
> Praveen
>
>
> On 20-Aug-16 2:22 PM, minh chau wrote:
>> Hi Long, Praveen,
>>
>> Regarding this TODO
>> +  if(notification) {
>> +// TODO (minhchau): memleak if notification is an array
>> + osaf_extended_name_free(¬ification->member.compName);
>>  free(notification);
>> +  }
>>
>> Client currently uses saAmfProtectionGroupNotificationFree_4(handle,
>> buff->notification) to free the notification in buffer.
>> If @buff->notification is a list of shortDn only, that should work as
>> before, as agent will call this inside
>> saAmfProtectionGroupNotificationFree_4
>>
>> /* free memory */
>> if(notification)
>> free(notification);
>>
>> It will cause memory leak if @buff->notification contains a list of
>> longDN notifications.
>> The leak is longDn of compName in each notification after the the first
>> one in the array @buff->notification.
>>
>> Agent can add a sentinel element when agent allocates
>> @buff->notification, set this last element as NULL
>> In Free() API, agent could iterate and free longDn in each element of
>> array @buff->notification until agent reaches NULL element.
>>
>> Do you think it could work?
>
>>
>> Thanks,
>> Minh
>>
>> On 19/08/16 21:13, Long Nguyen wrote:
>>> Hi Praveen,
>>>
>>> Please see my answers marked with [Long].
>>>
>>> Best regards,
>>> Long Nguyen.
>>>
>>> On 8/19/2016 6:01 PM, praveen malviya wrote:
>>>> Hi Long,
>>>>
>>>> I see one problem if B.01.01 application frees the memory in pg
>>>> tracking callback.
>>>> Please see inline.
>>>>
>>>> Thanks,
>>>> Praveen
>>>> On 19-Aug-16 12:00 PM, Long HB Nguyen wrote:
>>>>>  osaf/libs/agents/saf/amfa/amf_agent.cc | 1 +
>>>>>  osaf/libs/agents/saf/amfa/ava_hdl.cc   |  2 --
>>>>>  2 files changed, 1 insertions(+), 2 deletions(-)
>>>>>
>>>>>
>>>>> diff --git a/osaf/libs/agents/saf/amfa/amf_agent.cc
>>>>> b/osaf/libs/agents/saf/amfa/amf_agent.cc
>>>>> --- a/osaf/libs/agents/saf/amfa/amf_agent.cc
>>>>> +++ b/osaf/libs/agents/saf/amfa/amf_agent.cc
>>>>> @@ -2450,6 +2450,7 @@ SaAisErrorT AmfAgent::ProtectionGroupTra
>>>>> ava_cpy_protection_group_ntf(buf->notification,
>>>>> rsp_buf->notification,
>>>>> buf->numberOfItems,
>>>>> SA_AMF_HARS_READY_FOR_ASSIGNMENT);
>>>>>rc = SA_AIS_ERR_NO_SPACE;
>>>>> +  buf->numberOfItems = rsp_buf->numberOfItems;
>>>>>  }
>>>>>} else {/* if(create_memory == false) */
>>>>>
>>>>> diff --git a/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>>>> b/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>>>> --- a/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>>>> +++ b/osaf/libs/agents/saf/amfa/ava_hdl.cc
>>>>> @@ -697,7 +697,6 @@ uint32_t ava_hdl_cbk_rec_prc(AVSV_AMF_CB
>>>>> ((SaAmfCallbacksT_4*)reg_cbk)->saAmfProtectionGroupTrackCallback(&pg_track->csi_name,
>>>>>  
>>>>>
>>>>>
>>>>> &buf,
>>>>> pg_track->mem_num, pg_track->err);
>>>>> -free(buf.notification);
>>>>>  } else {
>>>>>  pg_track->err = SA_AIS_ERR_NO_MEMORY;
>>>>>  LOG_CR("Notification is NULL: Invoking
>>>>> PGTrack Callback with er

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-21 Thread minh chau
Hi Praveen,

The case you just mentioned is still in callback context, so Agent can 
help application to release the allocated notification. But still 
another case:

+SaAmfProtectionGroupNotificationBufferT buff;
+buff.notification = NULL;
+rc = saAmfProtectionGroupTrack_4(my_amf_hdl, &track_csi, 
SA_TRACK_CURRENT, &buff);
+if (rc != SA_AIS_OK) {
+syslog(LOG_ERR, "saAmfProtectionGroupTrack FAILED - %u", rc);
+goto done;
+}

In this case Agent has to allocate notification but it's not in Agent's 
context.
Application has to call API Free_4(buff.notification) to free up 
notification.
In order to iterate to free longDn(s) inside Free_4(), Agent has to 
memorize a list numberOfItems for every single call as above Track_4(), 
or Agent can add sentinel element to the allocated notification.

Thanks,
Minh

On 22/08/16 15:34, praveen malviya wrote:
> Hi,
> The callback looks like this:
> typedef void
> (*SaAmfProtectionGroupTrackCallbackT_4)(
> const SaNameT *csiName,
> SaAmfProtectionGroupNotificationBufferT_4 *notificationBuffer,
> SaUint32T numberOfMembers,
> SaAisErrorT error);
>
> Inside this callback, application is supposed to call 
> saAmfProtectionGroupNotificationFree_4(). So agent must be able to 
> deduce this information as SaAmfProtectionGroupNotificationBufferT_4 
> contains numberOfItems and also numberOfMembers is available from 
> callback.
> Since B.04.01 APIs are not fully implemented, agent copies from old 
> type of structure to new type in ava_cpy_protection_group_ntf().
>
>
> Thanks,
> Praveen
>
> On 22-Aug-16 10:51 AM, minh chau wrote:
>> Hi Praveen,
>>
>> The problem with B.04.01 is the API:
>> saAmfProtectionGroupNotificationFree_4(SaAmfHandleT hdl,
>> SaAmfProtectionGroupNotificationT_4 *notification) does not have
>> numberOfItems.
>> Agent does not know how many element in *notification, each of element
>> can hide a longDn inside it.
>>
>> Thanks,
>> Minh
>>
>>
>> On 22/08/16 15:04, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> SaAmfProtectionGroupNotificationBufferT_4() contains numberOfItems to
>>> iterate over. In case of B.04.01, it should be simple as agent can
>>> call direclty osaf_extended_name_free() during iteration inside
>>> saAmfProtectionGroupNotificationFree_4(). So I think, only a for loop
>>> which will iterate over numberOfItems is required.
>>>
>>> Problem was in B.01.01 case, where application will have to iterate
>>> and free the memory. For this, Long has already suggested and that
>>> needs to be documented.
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 20-Aug-16 2:22 PM, minh chau wrote:
>>>> Hi Long, Praveen,
>>>>
>>>> Regarding this TODO
>>>> +  if(notification) {
>>>> +// TODO (minhchau): memleak if notification is an array
>>>> + osaf_extended_name_free(¬ification->member.compName);
>>>>  free(notification);
>>>> +  }
>>>>
>>>> Client currently uses saAmfProtectionGroupNotificationFree_4(handle,
>>>> buff->notification) to free the notification in buffer.
>>>> If @buff->notification is a list of shortDn only, that should work as
>>>> before, as agent will call this inside
>>>> saAmfProtectionGroupNotificationFree_4
>>>>
>>>> /* free memory */
>>>> if(notification)
>>>> free(notification);
>>>>
>>>> It will cause memory leak if @buff->notification contains a list of
>>>> longDN notifications.
>>>> The leak is longDn of compName in each notification after the the 
>>>> first
>>>> one in the array @buff->notification.
>>>>
>>>> Agent can add a sentinel element when agent allocates
>>>> @buff->notification, set this last element as NULL
>>>> In Free() API, agent could iterate and free longDn in each element of
>>>> array @buff->notification until agent reaches NULL element.
>>>>
>>>> Do you think it could work?
>>>
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 19/08/16 21:13, Long Nguyen wrote:
>>>>> Hi Praveen,
>>>>>
>>>>> Please see my answers marked with [Long].
>>>>>
>>>>> Best regards,
>>>>> Long Nguyen.
>>>>>
>>>>> On 8/19/2016 6:01 PM, praveen malviya wrote:
>>

Re: [devel] [PATCH 1 of 1] ntfsv: refactor logging long dn notification [#1585]

2016-08-22 Thread minh chau
Hi Vu,

Ack from me (tested)

Thanks,
Minh

On 16/08/16 13:30, Vu Minh Nguyen wrote:
> Hi all,
>
> Do you have any comments on the updated patch (update test code)? Thanks.
>
> Regards, Vu
>
>> -Original Message-
>> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> Sent: Thursday, August 11, 2016 3:33 PM
>> To: praveen malviya ; Lennart Lund
>> ; Minh Hon Chau
>> 
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: [devel] [PATCH 1 of 1] ntfsv: refactor logging long dn
> notification
>> [#1585]
>>
>>   osaf/services/saf/ntfsv/ntfs/NtfLogger.cc   |   51 +-
>>   tests/ntfsv/tet_longDnObject_notification.c |  188
>> +++-
>>   2 files changed, 196 insertions(+), 43 deletions(-)
>>
>>
>> Remove the part of code that truncates the long DN.
>> And update the long DN test suite (#36) to make sure full record logged.
>>
>> diff --git a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
>> b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
>> --- a/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
>> +++ b/osaf/services/saf/ntfsv/ntfs/NtfLogger.cc
>> @@ -21,6 +21,7 @@
>>*/
>>   #include 
>>
>> +#include "osaf_utility.h"
>>   #include "saAis.h"
>>   #include "saLog.h"
>>   #include "NtfAdmin.hh"
>> @@ -232,48 +233,22 @@ SaAisErrorT NtfLogger::logNotification(N
>>  notif->getNotificationId(),
>>  SA_LOG_RECORD_WRITE_ACK,
>>  &logRecord);
>> -if (SA_AIS_OK != errorCode) {
>> -  LOG_NO("Failed to log an alarm or security alarm notification
> (%d)",
>> errorCode);
>> -  if (errorCode == SA_AIS_ERR_LIBRARY || errorCode ==
>> SA_AIS_ERR_BAD_HANDLE) {
>> -LOG_ER("Fatal error SA_AIS_ERR_LIBRARY or
>> SA_AIS_ERR_BAD_HANDLE; exiting (%d)...", errorCode);
>> -exit(EXIT_FAILURE);
>> -  } else if (errorCode == SA_AIS_ERR_INVALID_PARAM) {
>> -/* Retry to log truncated notificationObject/notifyingObject
> because
>> - * LOG Service has not supported long dn in Opensaf 4.5
>> - */
>> -char short_dn[SA_MAX_UNEXTENDED_NAME_LENGTH];
>> -memset(&short_dn, 0, SA_MAX_UNEXTENDED_NAME_LENGTH);
>> -SaNameT shortdn_notificationObject, shortdn_notifyingObject;
>> -if (osaf_is_an_extended_name(ntfHeader->notificationObject)) {
>> -  strncpy(short_dn, osaf_extended_name_borrow(ntfHeader-
>>> notificationObject)
>> -  , SA_MAX_UNEXTENDED_NAME_LENGTH - 1);
>> -  osaf_extended_name_lend(short_dn, &shortdn_notificationObject);
>> -  logRecord.logHeader.ntfHdr.notificationObject =
>> &shortdn_notificationObject;
>> -}
>> -if (osaf_is_an_extended_name(ntfHeader->notifyingObject)) {
>> -  strncpy(short_dn, osaf_extended_name_borrow(ntfHeader-
>>> notifyingObject)
>> -  , SA_MAX_UNEXTENDED_NAME_LENGTH - 1);
>> -  osaf_extended_name_lend(short_dn, &shortdn_notifyingObject);
>> -  logRecord.logHeader.ntfHdr.notifyingObject =
>> &shortdn_notifyingObject;
>> -}
>> -if (short_dn[0] != '\0') {
>> -  LOG_NO("Retry to log the truncated
>> notificationObject/notifyingObject");
>> -  if ((errorCode = saLogWriteLogAsync(alarmStreamHandle,
>> -  notif->getNotificationId(),
>> -  SA_LOG_RECORD_WRITE_ACK,
>> -  &logRecord)) != SA_AIS_OK)
> {
>> -LOG_ER("Failed to log the truncated
>> notificationObject/notifyingObject (%d)"
>> -   , errorCode);
>> -  }
>> -}
>> -  }
>> -  goto end;
>> +switch (errorCode) {
>> +case SA_AIS_OK:
>> +break;
>> +
>> +/* LOGsv is busy. Put the notification to queue and re-send next time
> */
>> +case SA_AIS_ERR_TRY_AGAIN:
>> +case SA_AIS_ERR_TIMEOUT:
>> +TRACE("Failed to log notification (ret: %d). Try next time.",
>> errorCode);
>> +break;
>> +
>> +default:
>> +osaf_abort(errorCode);
>>   }
>> }
>>
>> -end:
>> TRACE_LEAVE();
>> -
>> return errorCode;
>>   }
>>
>> diff --git a/tests/ntfsv/tet_longDnObject_notification.c
>> b/tests/ntfsv/tet_longDnObject_notification.c
>> --- a/tests/ntfsv/tet_longDnObject_notification.c
>> +++ b/tests/ntfsv/tet_longDnObject_notification.c
>> @@ -19,6 +19,7 @@
>>*/
>>   #include 
>>   #include 
>> +#include 
>>   #include "tet_ntf.h"
>>   #include "tet_ntf_common.h"
>>   //#include "util.h"
>> @@ -57,6 +58,166 @@ static SaNtfSecurityAlarmNotificationT m
>>   extern void saAisNameLend(SaConstStringT value, SaNameT* name);
>>   extern SaConstStringT saAisNameBorrow(const SaNameT* name);
>>
>> +//>
>> +// For backup and restore IMM attribute values.
>> +//<
>> +
>> +#define MAX_DATA 256
>> +typedef struct {
>> +char name[MAX_DATA];
>> +char val[MAX_DATA];
>> +int val_is_num;
>> +} attrinfo_t;
>> +

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-22 Thread minh chau
Hi Praveen,

One comment in line with [Minh]

Thanks
Minh

On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>   osaf/services/saf/amf/amfd/include/sg.h  |   1 +
>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62 
> ++-
>   3 files changed, 62 insertions(+), 5 deletions(-)
>
>
> Currently 2N, N-Way Active and NoRed models are supported for lock, shutdown,
> lock-in and unlock-in admin operations on NGs.
>
> This patch supports NplusM model also.
>
> diff --git a/osaf/services/saf/amf/amfd/include/sg.h 
> b/osaf/services/saf/amf/amfd/include/sg.h
> --- a/osaf/services/saf/amf/amfd/include/sg.h
> +++ b/osaf/services/saf/amf/amfd/include/sg.h
> @@ -507,6 +507,7 @@ public:
>   uint32_t susi_failed(AVD_CL_CB *cb, AVD_SU *su,
>   struct avd_su_si_rel_tag *susi, AVSV_SUSI_ACT act, 
> SaAmfHAStateT state);
>   void node_fail_si_oper(AVD_CL_CB *cb, AVD_SU *su);
> + void ng_admin(AVD_SU *su, AVD_AMF_NG *ng);
>   
>   private:
>   uint32_t su_fault_su_oper(AVD_CL_CB *cb, AVD_SU *su);
> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc 
> b/osaf/services/saf/amf/amfd/nodegroup.cc
> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
> @@ -687,6 +687,7 @@ void avd_ng_admin_state_set(AVD_AMF_NG*
>   avd_send_admin_state_chg_ntf(&ng->name,
>   (SaAmfNotificationMinorIdT)SA_AMF_NTFID_NG_ADMIN_STATE,
>   old_state, ng->saAmfNGAdminState);
> + TRACE_LEAVE();
>   }
>   /**
>* @brief  Verify if Node is stable for admin operation on Nodegroup etc.
> @@ -749,8 +750,7 @@ static SaAisErrorT check_red_model_servi
>   LOG_NO("service outage for '%s' because of 
> shutdown/lock "
>   "on 
> '%s'",sg->name.value,ng->name.value);
>   
> - if ((sg->sg_redundancy_model == SA_AMF_N_WAY_REDUNDANCY_MODEL) 
> ||
> - (sg->sg_redundancy_model == 
> SA_AMF_NPM_REDUNDANCY_MODEL)) {
> + if (sg->sg_redundancy_model == SA_AMF_N_WAY_REDUNDANCY_MODEL) {
>   LOG_NO("Admin op on '%s'  hosting SUs of '%s' with 
> redundancy '%u' "
>   "is not supported",ng->name.value, 
> sg->name.value,
>   sg->sg_redundancy_model);
> diff --git a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc 
> b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
> --- a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
> +++ b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
> @@ -120,16 +120,16 @@ static AVD_SU_SI_REL *avd_sg_npm_su_othr
>   if (i_susi->si->list_of_sisu != i_susi) {
>   o_susi = i_susi->si->list_of_sisu;
>   if (o_susi->fsm != AVD_SU_SI_STATE_UNASGN)
> - return o_susi;
> + break;
>   } else if (i_susi->si->list_of_sisu->si_next != 
> AVD_SU_SI_REL_NULL) {
>   o_susi = i_susi->si->list_of_sisu->si_next;
>   if (o_susi->fsm != AVD_SU_SI_STATE_UNASGN)
> - return o_susi;
> + break;
>   }
>   
>   i_susi = i_susi->su_next;
>   }
> -
> + TRACE_LEAVE2("o_susi:'%p'",o_susi);
>   return o_susi;
>   }
>   
> @@ -4452,6 +4452,62 @@ uint32_t SG_NPM::sg_admin_down(AVD_CL_CB
>   return NCSCC_RC_SUCCESS;
>   }
>   
> +/*
> + * @brief  Handles modification of assignments in SU of NpM SG
> + * because of lock or shutdown operation on Node group.
> + * If SU does not have any SIs assigned to it, AMF will try
> + * to instantiate new SUs in the SG. If SU has assignments,
> + * then depending upon lock or shutdown operation, quiesced
> + * or quiescing state will be sent for active SIs in SU.
> + *  If SU has only standby assignments then remove the assignments.
> + *
> + * @param[in]  ptr to SU
> + * @param[in]  ptr to nodegroup AVD_AMF_NG.
> + */
> +void SG_NPM::ng_admin(AVD_SU *su, AVD_AMF_NG *ng)
> +{
> +  SaAmfHAStateT ha_state;
> +
> +  TRACE_ENTER2("'%s', sg_fsm_state:%u",su->name.value,
> +su->sg_of_su->sg_fsm_state);
> +
> +  if (su->list_of_susi == nullptr) {
> +avd_sg_app_su_inst_func(avd_cb, su->sg_of_su);
> +return;
> +  }
> +
> +  if (ng->saAmfNGAdminState == SA_AMF_ADMIN_SHUTTING_DOWN)
> +ha_state = SA_AMF_HA_QUIESCING;
> +  else
> +ha_state = SA_AMF_HA_QUIESCED;
> +
> +  if (su->list_of_susi->state == SA_AMF_HA_ACTIVE) {
> +if (avd_sg_su_si_mod_snd(avd_cb, su, ha_state) == NCSCC_RC_FAILURE) {
> +  LOG_ER("quiescing/quiesced state transtion failed for 
> '%s'",su->name.value);
> +  goto done;
> +}
> +  } else {
> +if (avd_sg_su_si_del_snd(avd_cb, su) == NCSCC_RC_FAILURE) {
> +  LOG_ER("removal of standby assignment fai

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-22 Thread minh chau
Hi Praveen,

Since AMF longDn has been pushed, can you please attach a longDn rebased 
version to ticket (both #1454 + #1608) so we can do some test?

Thanks,
Minh

On 23/08/16 15:56, praveen malviya wrote:
> Hi Minh,
>
> Thanks for reviewing the patch.
> Please see inline with [Praveen]
>
> Thanks,
> Praveen
>
>
>
> On 23-Aug-16 5:53 AM, minh chau wrote:
>> Hi Praveen,
>>
>> One comment in line with [Minh]
>>
>> Thanks
>> Minh
>>
>> On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>>> osaf/services/saf/amf/amfd/include/sg.h  |   1 +
>>>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62
>>> ++-
>>>   3 files changed, 62 insertions(+), 5 deletions(-)
>>>
>>>
>>> Currently 2N, N-Way Active and NoRed models are supported for lock,
>>> shutdown,
>>> lock-in and unlock-in admin operations on NGs.
>>>
>>> This patch supports NplusM model also.
>>>
>>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h
>>> b/osaf/services/saf/amf/amfd/include/sg.h
>>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>>> +++ b/osaf/services/saf/amf/amfd/include/sg.h
>>> @@ -507,6 +507,7 @@ public:
>>>   uint32_t susi_failed(AVD_CL_CB *cb, AVD_SU *su,
>>>   struct avd_su_si_rel_tag *susi, AVSV_SUSI_ACT act,
>>> SaAmfHAStateT state);
>>>   void node_fail_si_oper(AVD_CL_CB *cb, AVD_SU *su);
>>> +void ng_admin(AVD_SU *su, AVD_AMF_NG *ng);
>>> private:
>>>   uint32_t su_fault_su_oper(AVD_CL_CB *cb, AVD_SU *su);
>>> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc
>>> b/osaf/services/saf/amf/amfd/nodegroup.cc
>>> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
>>> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
>>> @@ -687,6 +687,7 @@ void avd_ng_admin_state_set(AVD_AMF_NG*
>>>   avd_send_admin_state_chg_ntf(&ng->name,
>>> (SaAmfNotificationMinorIdT)SA_AMF_NTFID_NG_ADMIN_STATE,
>>>   old_state, ng->saAmfNGAdminState);
>>> +TRACE_LEAVE();
>>>   }
>>>   /**
>>>* @brief  Verify if Node is stable for admin operation on Nodegroup
>>> etc.
>>> @@ -749,8 +750,7 @@ static SaAisErrorT check_red_model_servi
>>>   LOG_NO("service outage for '%s' because of 
>>> shutdown/lock "
>>>   "on '%s'",sg->name.value,ng->name.value);
>>>   -if ((sg->sg_redundancy_model ==
>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) ||
>>> -(sg->sg_redundancy_model ==
>>> SA_AMF_NPM_REDUNDANCY_MODEL)) {
>>> +if (sg->sg_redundancy_model == 
>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) {
>>>   LOG_NO("Admin op on '%s'  hosting SUs of '%s' with
>>> redundancy '%u' "
>>>   "is not supported",ng->name.value, 
>>> sg->name.value,
>>>   sg->sg_redundancy_model);
>>> diff --git a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>> b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>> --- a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>> +++ b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>> @@ -120,16 +120,16 @@ static AVD_SU_SI_REL *avd_sg_npm_su_othr
>>>   if (i_susi->si->list_of_sisu != i_susi) {
>>>   o_susi = i_susi->si->list_of_sisu;
>>>   if (o_susi->fsm != AVD_SU_SI_STATE_UNASGN)
>>> -return o_susi;
>>> +break;
>>>   } else if (i_susi->si->list_of_sisu->si_next !=
>>> AVD_SU_SI_REL_NULL) {
>>>   o_susi = i_susi->si->list_of_sisu->si_next;
>>>   if (o_susi->fsm != AVD_SU_SI_STATE_UNASGN)
>>> -return o_susi;
>>> +break;
>>>   }
>>> i_susi = i_susi->su_next;
>>>   }
>>> -
>>> +TRACE_LEAVE2("o_susi:'%p'",o_susi);
>>>   return o_susi;
>>>   }
>>>   @@ -4452,6 +4452,62 @@ uint32_t SG_NPM::sg_admin_down(AVD_CL_CB
>>>   return NCSCC_RC_SUCCESS;
>>>   }
>>>   +/*
>>> + * @brief  Handles modification of assignments in SU of NpM SG
>>> + * because of lock or shutdown operation on Node group.
>&

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-23 Thread minh chau
Hi Praveen,

Agree this discussion should be continued, currently it causes a leak.

Thanks,
Minh

On 23/08/16 16:57, praveen malviya wrote:
> Hi Minh,
>
> I am going through the agent code to see if something can be done. I 
> think, since B.04.01 APIs are not implemented this problem is coming.
> But still all longdn patches can be pushed and this discussion can 
> continue.
>
> What do you think?
>
>
> Thanks,
> praveen
>
>
>
> On 22-Aug-16 12:08 PM, minh chau wrote:
>> Hi Praveen,
>>
>> The case you just mentioned is still in callback context, so Agent can
>> help application to release the allocated notification. But still
>> another case:
>>
>> +SaAmfProtectionGroupNotificationBufferT buff;
>> +buff.notification = NULL;
>> +rc = saAmfProtectionGroupTrack_4(my_amf_hdl, &track_csi,
>> SA_TRACK_CURRENT, &buff);
>> +if (rc != SA_AIS_OK) {
>> +syslog(LOG_ERR, "saAmfProtectionGroupTrack FAILED - %u", rc);
>> +goto done;
>> +}
>>
>> In this case Agent has to allocate notification but it's not in Agent's
>> context.
>> Application has to call API Free_4(buff.notification) to free up
>> notification.
>> In order to iterate to free longDn(s) inside Free_4(), Agent has to
>> memorize a list numberOfItems for every single call as above Track_4(),
>> or Agent can add sentinel element to the allocated notification.
>>
>> Thanks,
>> Minh
>>
>> On 22/08/16 15:34, praveen malviya wrote:
>>> Hi,
>>> The callback looks like this:
>>> typedef void
>>> (*SaAmfProtectionGroupTrackCallbackT_4)(
>>> const SaNameT *csiName,
>>> SaAmfProtectionGroupNotificationBufferT_4 *notificationBuffer,
>>> SaUint32T numberOfMembers,
>>> SaAisErrorT error);
>>>
>>> Inside this callback, application is supposed to call
>>> saAmfProtectionGroupNotificationFree_4(). So agent must be able to
>>> deduce this information as SaAmfProtectionGroupNotificationBufferT_4
>>> contains numberOfItems and also numberOfMembers is available from
>>> callback.
>>> Since B.04.01 APIs are not fully implemented, agent copies from old
>>> type of structure to new type in ava_cpy_protection_group_ntf().
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 22-Aug-16 10:51 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> The problem with B.04.01 is the API:
>>>> saAmfProtectionGroupNotificationFree_4(SaAmfHandleT hdl,
>>>> SaAmfProtectionGroupNotificationT_4 *notification) does not have
>>>> numberOfItems.
>>>> Agent does not know how many element in *notification, each of element
>>>> can hide a longDn inside it.
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>>
>>>> On 22/08/16 15:04, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> SaAmfProtectionGroupNotificationBufferT_4() contains numberOfItems to
>>>>> iterate over. In case of B.04.01, it should be simple as agent can
>>>>> call direclty osaf_extended_name_free() during iteration inside
>>>>> saAmfProtectionGroupNotificationFree_4(). So I think, only a for loop
>>>>> which will iterate over numberOfItems is required.
>>>>>
>>>>> Problem was in B.01.01 case, where application will have to iterate
>>>>> and free the memory. For this, Long has already suggested and that
>>>>> needs to be documented.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>>
>>>>> On 20-Aug-16 2:22 PM, minh chau wrote:
>>>>>> Hi Long, Praveen,
>>>>>>
>>>>>> Regarding this TODO
>>>>>> +  if(notification) {
>>>>>> +// TODO (minhchau): memleak if notification is an array
>>>>>> + osaf_extended_name_free(¬ification->member.compName);
>>>>>>  free(notification);
>>>>>> +  }
>>>>>>
>>>>>> Client currently uses saAmfProtectionGroupNotificationFree_4(handle,
>>>>>> buff->notification) to free the notification in buffer.
>>>>>> If @buff->notification is a list of shortDn only, that should 
>>>>>> work as
>>>>>> before, as agent will call this inside
>>>>>> saAmfProt

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-23 Thread minh chau
Hi Nagu,

Please use the patches attached to ticket to test, there are many 
changes from the version you are testing.
https://sourceforge.net/p/opensaf/tickets/_discuss/thread/7b203666/ad7f/attachment/1725_phase_1.tgz
It's a rebased  longDN version, so you will not have to tested again

I just tested the rebased longDN version, it works for me.

2016-08-23 20:19:25 PL-3 osafimmnd[400]: NO Implementer connected: 20 
(safSmfService) <0, 2010f>
*2016-08-23 20:19:26 PL-3 osafamfnd[421]: NO Found and resend buffered 
su_si_assign msg for SU:'safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1', 
SI:'', ha_state:'3', msg_act:'5', single_csi:'0', error:'1', msg_id:'2'*
2016-08-23 20:19:26 PL-3 osafamfnd[421]: NO Removing 
'safSi=AmfDemo1,safApp=AmfDemo1' from 
'safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1'

2016-08-23 20:18:53 PL-4 osafamfnd[421]: NO Assigning 
'safSi=AmfDemo1,safApp=AmfDemo1' STANDBY to 
'safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1'
2016-08-23 20:18:53 PL-4 amf_demo[568]: CSI Set - add 
'safCsi=AmfDemo,safSi=AmfDemo1,safApp=AmfDemo1' HAState Standby
2016-08-23 20:18:53 PL-4 osafamfnd[421]: NO Assigned 
'safSi=AmfDemo1,safApp=AmfDemo1' STANDBY to 
'safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1'
...
2016-08-23 20:19:24 PL-4 osafamfnd[421]: NO Sending node up due to 
NCSMDS_NEW_ACTIVE
2016-08-23 20:19:24 PL-4 osafamfnd[421]: NO 2 SISU states sent
2016-08-23 20:19:24 PL-4 osafamfnd[421]: NO 3 SU states sent
2016-08-23 20:19:24 PL-4 osafamfnd[421]: NO 6 CSICOMP states synced
2016-08-23 20:19:24 PL-4 osafamfnd[421]: NO 7 SU states sent
2016-08-23 20:19:24 PL-4 osafckptnd[441]: NO CLM selection object was 
updated. (11)
2016-08-23 20:19:24 PL-4 osafimmnd[397]: NO Implementer connected: 17 
(safAmfService) <0, 2010f>
2016-08-23 20:19:25 PL-4 osafimmnd[397]: NO Implementer connected: 18 
(safCheckPointService) <0, 2010f>
2016-08-23 20:19:25 PL-4 osafimmnd[397]: NO Implementer disconnected 18 
<0, 2010f> (safCheckPointService)
2016-08-23 20:19:25 PL-4 osafimmnd[397]: NO Implementer connected: 19 
(safCheckPointService) <0, 2010f>
2016-08-23 20:19:25 PL-4 osafimmnd[397]: NO Implementer connected: 20 
(safSmfService) <0, 2010f>
2016-08-23 20:19:26 PL-4 osafamfnd[421]: NO Assigning 
'safSi=AmfDemo1,safApp=AmfDemo1' ACTIVE to 
'safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1'
2016-08-23 20:19:26 PL-4 amf_demo[568]: CSI Set - HAState Active for all 
assigned CSIs
2016-08-23 20:19:26 PL-4 osafamfnd[421]: NO Assigned 
'safSi=AmfDemo1,safApp=AmfDemo1' ACTIVE to 
'safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1'
2016-08-23 20:19:26 PL-4 osafamfnd[421]: NO Assigning 
'safSi=AmfDemo1,safApp=AmfDemo1' STANDBY to 
'safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1'
2016-08-23 20:19:26 PL-4 amf_demo[579]: CSI Set - add 
'safCsi=AmfDemo,safSi=AmfDemo1,safApp=AmfDemo1' HAState Standby
2016-08-23 20:19:26 PL-4 osafamfnd[421]: NO Assigned 
'safSi=AmfDemo1,safApp=AmfDemo1' STANDBY to 
'safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1'

Thanks,
Minh

On 23/08/16 19:48, Nagendra Kumar wrote:
> Please note that it is on change set 7846:31417997c82f  and I have applied 
> patch of ticket #1894.
>
> Thanks
> -Nagu
>> -Original Message-
>> From: Nagendra Kumar
>> Sent: 23 August 2016 15:15
>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if csi
>> callback completes during headless [#1725 part 1] V1
>>
>> Hi Minh,
>>  The following SU lock case is not working. This issue will exist for all
>> the flows, so please check.
>>
>> Configuration and traces attached in the ticket.
>>
>> Steps:
>> 1. Start SC-1, SC-2, PL-3 and PL-4. Run the following command:
>> immcfg -f  /tmp/AppConfig-2N-1725.xml
>> amf-adm unlock-in safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1
>> amf-adm unlock-in safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1
>> amf-adm unlock-in safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1
>> amf-adm unlock safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1
>> amf-adm unlock safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1
>> amf-adm unlock safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1
>>
>> Assignments are:
>> PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd  status
>> safSISU=safSu=SC-
>> 1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-
>> 2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=PL-
>> 4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=PL-
>> 3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safS

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-23 Thread minh chau
Hi Praveen,

I will use them for test.

Thanks,
Minh

On 23/08/16 16:58, praveen malviya wrote:
> Hi Minh,
>
> I have attached patches for #1454 and #1608 in the ticket #1454.
> Please apply them in order.
>
> Thanks,
> Praveen
>
> On 23-Aug-16 11:56 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Since AMF longDn has been pushed, can you please attach a longDn rebased
>> version to ticket (both #1454 + #1608) so we can do some test?
>>
>> Thanks,
>> Minh
>>
>> On 23/08/16 15:56, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Thanks for reviewing the patch.
>>> Please see inline with [Praveen]
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>>
>>> On 23-Aug-16 5:53 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> One comment in line with [Minh]
>>>>
>>>> Thanks
>>>> Minh
>>>>
>>>> On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>>>>> osaf/services/saf/amf/amfd/include/sg.h |   1 +
>>>>>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>>>>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62
>>>>> ++-
>>>>>   3 files changed, 62 insertions(+), 5 deletions(-)
>>>>>
>>>>>
>>>>> Currently 2N, N-Way Active and NoRed models are supported for lock,
>>>>> shutdown,
>>>>> lock-in and unlock-in admin operations on NGs.
>>>>>
>>>>> This patch supports NplusM model also.
>>>>>
>>>>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h
>>>>> b/osaf/services/saf/amf/amfd/include/sg.h
>>>>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>>>>> +++ b/osaf/services/saf/amf/amfd/include/sg.h
>>>>> @@ -507,6 +507,7 @@ public:
>>>>>   uint32_t susi_failed(AVD_CL_CB *cb, AVD_SU *su,
>>>>>   struct avd_su_si_rel_tag *susi, AVSV_SUSI_ACT act,
>>>>> SaAmfHAStateT state);
>>>>>   void node_fail_si_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>> +void ng_admin(AVD_SU *su, AVD_AMF_NG *ng);
>>>>> private:
>>>>>   uint32_t su_fault_su_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> @@ -687,6 +687,7 @@ void avd_ng_admin_state_set(AVD_AMF_NG*
>>>>>   avd_send_admin_state_chg_ntf(&ng->name,
>>>>> (SaAmfNotificationMinorIdT)SA_AMF_NTFID_NG_ADMIN_STATE,
>>>>>   old_state, ng->saAmfNGAdminState);
>>>>> +TRACE_LEAVE();
>>>>>   }
>>>>>   /**
>>>>>* @brief  Verify if Node is stable for admin operation on 
>>>>> Nodegroup
>>>>> etc.
>>>>> @@ -749,8 +750,7 @@ static SaAisErrorT check_red_model_servi
>>>>>   LOG_NO("service outage for '%s' because of
>>>>> shutdown/lock "
>>>>>   "on '%s'",sg->name.value,ng->name.value);
>>>>>   -if ((sg->sg_redundancy_model ==
>>>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) ||
>>>>> -(sg->sg_redundancy_model ==
>>>>> SA_AMF_NPM_REDUNDANCY_MODEL)) {
>>>>> +if (sg->sg_redundancy_model ==
>>>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) {
>>>>>   LOG_NO("Admin op on '%s'  hosting SUs of '%s' with
>>>>> redundancy '%u' "
>>>>>   "is not supported",ng->name.value,
>>>>> sg->name.value,
>>>>>   sg->sg_redundancy_model);
>>>>> diff --git a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>>>> b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>>>> --- a/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>>>> +++ b/osaf/services/saf/amf/amfd/sg_npm_fsm.cc
>>>>> @@ -120,16 +120,16 @@ static AVD_SU_SI_REL *avd_sg_npm_su_othr
>>>>>   if (i_susi->si->list_of_sisu != i_susi) {
>>>>>   o_susi = i_susi->si->list_of_sisu;
>>>>>   if (o

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-23 Thread minh chau
Hi Nagu,

I see in the trace you provided, the SU2/SU3 become IN_SERVICE late. If 
there's a delay in PL4 joining cluster after headless in your test then 
you could also see it in the latest patches (longDN rebased version)
I'm looking in to this issue.

Thanks.
Minh

On 23/08/16 20:24, Nagendra Kumar wrote:
> Please ignore TC #2, my mistake.
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: Nagendra Kumar
>> Sent: 23 August 2016 15:49
>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if csi
>> callback completes during headless [#1725 part 1] V1
>>
>> Please consider previous TC as TC #1
>>
>> TC #2: Same configuration as TC #1. Logs attached in the ticket TC #2.
>>
>> Steps:
>> 1. Same as step #1 of TC #1.
>> 2. After locking SU1, keep delay in avnd_evt_avd_info_su_si_assign_evh and
>> stop SC-1 and SC-2.
>> 3. Start SC-1 and SC-2. SU1 is still in quisced state. Ideally, it should 
>> have no
>> assignment and SU3 should have got assignment.
>>
>> safSISU=safSu=SU3\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
>> mo1,safApp=AmfDemo1
>>  saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
>> mo1,safApp=AmfDemo1
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=PL-
>> 4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-
>> 1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-
>> 2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=PL-
>> 3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>>
>> After that PL-3 rebooted by the following logs:
>> Aug 23 15:31:52 PM_PL-3 osafamfwd[18056]: TIMEOUT receiving AMF
>> health check request, generating core for amfnd Aug 23 15:31:52 PM_PL-3
>> osafamfwd[18056]: Last received healthcheck cnt=82 at Tue Aug 23 15:30:52
>> 2016 Aug 23 15:31:52 PM_PL-3 osafamfwd[18056]: Rebooting OpenSAF
>> NodeId = 0 EE Name = No EE Mapped, Reason: AMFND unresponsive,
>> AMFWDOG initiated system reboot, OwnNodeId = 131855, SupervisionTime
>> = 60 Aug 23 15:31:52 PM_PL-3 opensaf_reboot: Rebooting local node;
>> timeout=60
>>
>> Thanks
>> -Nagu
>>
>>> -Original Message-
>>> From: Nagendra Kumar
>>> Sent: 23 August 2016 15:19
>>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if csi
>>> callback completes during headless [#1725 part 1] V1
>>>
>>> Please note that it is on change set 7846:31417997c82f  and I have
>>> applied patch of ticket #1894.
>>>
>>> Thanks
>>> -Nagu
 -Original Message-
 From: Nagendra Kumar
 Sent: 23 August 2016 15:15
 To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
 gary@dektech.com.au; long.hb.ngu...@dektech.com.au
 Cc: opensaf-devel@lists.sourceforge.net
 Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if
 csi callback completes during headless [#1725 part 1] V1

 Hi Minh,
The following SU lock case is not working. This issue will exist
 for all the flows, so please check.

 Configuration and traces attached in the ticket.

 Steps:
 1. Start SC-1, SC-2, PL-3 and PL-4. Run the following command:
 immcfg -f  /tmp/AppConfig-2N-1725.xml amf-adm unlock-in
 safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1
 amf-adm unlock-in safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1
 amf-adm unlock-in safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1
 amf-adm unlock safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1
 amf-adm unlock safSu=SU2,safSg=AmfDemo_2N,safApp=AmfDemo1
 amf-adm unlock safSu=SU3,safSg=AmfDemo_2N,safApp=AmfDemo1

 Assignments are:
 PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd  status
 safSISU=safSu=SC-
 1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
  saAmfSISUHAState=ACTIVE(1)
 safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
 2N,safApp=OpenSAF
  saAmfSISUHAState=ACTIVE(1)
 safSISU=safSu=SC-
 2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
  saAmfSISUHAState=ACTIVE(1)
 safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
 2N,safApp=OpenSAF
  saAmfSISUHAState=STANDBY(2)
 safSISU=safSu=PL-
 4

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-23 Thread minh chau
orms switchover. Why SUs of a node
that is not synced is becoming IN_SERVICE.

Also the AMFND where locked SU is hosted, should send buffered message
only when its NCS SUs are assigned. Such a check can be included in
avnd_diq_rec_send_buffered_msg().


I think I missed headless state here, NCS sus are already assigned.

What we need is AMFND should send buffered assignments after cluster 
timer expiry and node sync timer expiry. In headless case, after node 
sync timer only valid node will be present in the system. So 
failover/switchover to them will not be an issue. At the same time 
cluter timer expiry needs to be honored because AMFD code for 
application assignment state works in APP state that is linked to 
cluster timer expiry.


-Could we hold any assignment message received before cluster timer 
expiry and node sync timer expiry and process them after expiry?


-Another way can be AMFND should send buffered assignments only after 
expiry of these timers. What if AMFND starts some timer (larger value 
among cluster timer and node sync timer) on receiving AMFD up and on 
expiry of this timer sends the buffered assignment message.


Thanks,
Praveen


Thanks,
Praveen

On 23-Aug-16 4:33 PM, minh chau wrote:

Hi Nagu,

I see in the trace you provided, the SU2/SU3 become IN_SERVICE late. If
there's a delay in PL4 joining cluster after headless in your test then
you could also see it in the latest patches (longDN rebased version)
I'm looking in to this issue.

Thanks.
Minh

On 23/08/16 20:24, Nagendra Kumar wrote:

Please ignore TC #2, my mistake.

Thanks
-Nagu


-Original Message-
From: Nagendra Kumar
Sent: 23 August 2016 15:49
To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
gary@dektech.com.au; long.hb.ngu...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if 
csi

callback completes during headless [#1725 part 1] V1

Please consider previous TC as TC #1

TC #2: Same configuration as TC #1. Logs attached in the ticket TC 
#2.


Steps:
1. Same as step #1 of TC #1.
2. After locking SU1, keep delay in
avnd_evt_avd_info_su_si_assign_evh and
stop SC-1 and SC-2.
3. Start SC-1 and SC-2. SU1 is still in quisced state. Ideally, it
should have no
assignment and SU3 should have got assignment.

safSISU=safSu=SU3\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
mo1,safApp=AmfDemo1
 saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
mo1,safApp=AmfDemo1
 saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=PL-
4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
 saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-
1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
 saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
2N,safApp=OpenSAF
 saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-
2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
 saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
2N,safApp=OpenSAF
 saAmfSISUHAState=STANDBY(2)
safSISU=safSu=PL-
3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
 saAmfSISUHAState=ACTIVE(1)

After that PL-3 rebooted by the following logs:
Aug 23 15:31:52 PM_PL-3 osafamfwd[18056]: TIMEOUT receiving AMF
health check request, generating core for amfnd Aug 23 15:31:52 
PM_PL-3

osafamfwd[18056]: Last received healthcheck cnt=82 at Tue Aug 23
15:30:52
2016 Aug 23 15:31:52 PM_PL-3 osafamfwd[18056]: Rebooting OpenSAF
NodeId = 0 EE Name = No EE Mapped, Reason: AMFND unresponsive,
AMFWDOG initiated system reboot, OwnNodeId = 131855, SupervisionTime
= 60 Aug 23 15:31:52 PM_PL-3 opensaf_reboot: Rebooting local node;
timeout=60

Thanks
-Nagu


-Original Message-
From: Nagendra Kumar
Sent: 23 August 2016 15:19
To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
gary@dektech.com.au; long.hb.ngu...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation 
if csi

callback completes during headless [#1725 part 1] V1

Please note that it is on change set 7846:31417997c82f and I have
applied patch of ticket #1894.

Thanks
-Nagu

-Original Message-
From: Nagendra Kumar
Sent: 23 August 2016 15:15
To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
gary@dektech.com.au; long.hb.ngu...@dektech.com.au
Cc: opensaf-devel@lists.sourceforge.net
Subject: RE: [PATCH 2 of 2] AMFND: Admin operation continuation if
csi callback completes during headless [#1725 part 1] V1

Hi Minh,
The following SU lock case is not working. This issue will 
exist

for all the flows, so please check.

Configuration and traces attached in the ticket.

Steps:
1. Start SC-1, SC-2, PL-3 and PL-4. Run the following command:
immcfg -f  /tmp/AppConfig-2N-1725.xml amf-adm unlock-in
safSu=SU1,safSg=AmfDemo_2N,safApp=AmfDemo1
amf-adm unlock-in safSu=SU2

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-23 Thread minh chau
Hi Praveen,

Please let me copy your questions and answer here in email, so it's 
easier we can add comment in line, please see [Minh].

Thanks,
Minh

-

Hi Minh,
I am going through the patches 1725_phase1.tgz. Some initial comments:
1) In patch 2 avnd_diq_rec_send_buffered_msg() checks presence of SUSI 
then only it sends buffered message to AMFD. In case removal of 
assignments completes during headless , AMFND deletes the SUSIs in 
su_si_oper_done(). So AMFND will never send the assignment message and 
admin operation will not continue.

[Minh]: If this is the case AMFND deletes all SUSIs during headless, 
then there will not be any assignment to be sent in state_info message 
to AMFD after headless. However, in all admin operations of 2N I have 
been testing,
the removal assignment sequence is the last step of admin LOCK/SHUTDOWN. 
If AMFND deletes SUSI while headless, that also means the prior steps of 
admin sequence had been done before headless. In this case, that is 
equivalent to a completion of admin operation.

2) In patch1, I think after headless we will not get any invocation id 
for the admin operation that
was going on before headless. Since AMF is continuing the admin 
operation we should somehow
restrict other admin operation to start by setting some magic no for 
invocationid or any other way.

[Minh]: If AMF is continuing the admin operation after headless, the sg 
fsm state should not be STABLE, I think (sg_fsm_state == 
AVD_SG_FSM_STABLE) should be enough to reject new admin operation?


3)If suswitch is in TOGGLED state then I think we should crosscheck that 
there are atleast two SUs
having assignment. The reason is if this flag remains TOGGOLED and admin 
op does not continue then there is very less probability that if will 
get reset as it is used only in si-swap flow.

[Minh]: Yes I don't particularly like this osafAmfSUSwitch to be written 
to IMM. I had the only test case 144 failed (test list attached to ticket)
Test 144 is: Swap SI, delay csi STANDBY cbk in SU4, stop SCs, restart 
SCs, reboot PL5. And I ran into the code line which requires suswitch

void SG_2N::node_fail_su_oper(AVD_SU *su) {


 /* the SU has standby SI assignments. if the other SUs 
switch field
  * is true, it is in service, having quiesced assigning state.
  * Send D2N-INFO_SU_SI_ASSIGN modify active all to the 
other SU.
  * Change switch field to false. Change state to SG_realign.
  * Free all the SI assignments to this SU.
  */
 if ((su_oper_list_front()->su_switch == AVSV_SI_TOGGLE_SWITCH)
 && (su_oper_list_front()->saAmfSuReadinessState == 
SA_AMF_READINESS_IN_SERVICE)) {

I think the *crosscheck* is actually a deduction of @su_switch from 
whatever states that AMFD receives after headless. If *crosscheck* is 
possible thing, then su_switch does not need to be checkpointed at 
standby AMFD also.
In non-headless, we always need standby AMFD up-to-date all states by 
checkpoint so that if active AMFD has gone, the standby AMFD can take 
over by using these checkpointed states.
Now in headless, we also have to write these states somewhere (here is 
IMM) so that the new active AMFD can use it.
It's the best that su_switch is revertible from a set of states, but 
it's not easy to prove it's revertible from all scenarios of 2N si-swap.
If you think removing osafAmfSUSwitch is really needed, then this needs 
to be looked more thoroughly later I think?

4)Since assignments are in progress. This could be because of admin 
operation or
faults. AMFD should call one function here like log_admin_op(). This 
function will search the entity
that is being under admin operation and log details like:
-After headless state admin op on '%s' is continuing in syslog.
-Also traces for susi states which are not assigned.

[Minh]: Agree, some sort of logging like this is good idea, I think it's 
best to introduce this logging in the patch : [PATCH 4 of 4] AMFD: 
Validate headless cached RTA read from IMM [#1725]
And maybe I need more details of what you would like to log.

Thanks,
Praveen

-

On 23/08/16 21:03, minh chau wrote:
> Hi Nagu,
>
> I see in the trace you provided, the SU2/SU3 become IN_SERVICE late. 
> If there's a delay in PL4 joining cluster after headless in your test 
> then you could also see it in the latest patches (longDN rebased version)
> I'm looking in to this issue.
>
> Thanks.
> Minh
>
> On 23/08/16 20:24, Nagendra Kumar wrote:
>> Please ignore TC #2, my mistake.
>>
>> Thanks
>> -Nagu
>>
>>> -Original Message-
>>> From: Nagendra Kumar
>>> Sent: 23 August 2016 15:49
>>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>>> gary@dektech.com.au; long.h

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-23 Thread minh chau
Hi Praveen,

I have just attached to ticket the #1725 part 2 that supports fault of 
node restart/poweroff while headless.
Can you help to review it? I will update the patch that adds the logging 
as you suggested.

Thanks for reviewing,
Minh


On 24/08/16 16:07, praveen malviya wrote:
> Hi Minh,
>
> Please see responses with [Praveen].
>
>
> Thanks,
> Praveen
>
> On 23-Aug-16 7:18 PM, minh chau wrote:
>> Hi Praveen,
>>
>> Please let me copy your questions and answer here in email, so it's
>> easier we can add comment in line, please see [Minh].
>>
>> Thanks,
>> Minh
>>
>> -
>>
>> Hi Minh,
>> I am going through the patches 1725_phase1.tgz. Some initial comments:
>> 1) In patch 2 avnd_diq_rec_send_buffered_msg() checks presence of SUSI
>> then only it sends buffered message to AMFD. In case removal of
>> assignments completes during headless , AMFND deletes the SUSIs in
>> su_si_oper_done(). So AMFND will never send the assignment message and
>> admin operation will not continue.
>>
>> [Minh]: If this is the case AMFND deletes all SUSIs during headless,
>> then there will not be any assignment to be sent in state_info message
>> to AMFD after headless. However, in all admin operations of 2N I have
>> been testing,
>> the removal assignment sequence is the last step of admin LOCK/SHUTDOWN.
>> If AMFND deletes SUSI while headless, that also means the prior steps of
>> admin sequence had been done before headless. In this case, that is
>> equivalent to a completion of admin operation.
>>
> [Praveen]Yes, in this case it is not needed because by this time 
> standby SU has become active.
> But in some cases AMFD performs failover/switchover based on removal 
> of assignments status particularly when fault happens during admin op. 
> As of now I do not know how to reproduce this scenario without faults 
> but with faults it is possible. Since patch is not for admin op + 
> faults ,so it can be left for the future.
>> 2) In patch1, I think after headless we will not get any invocation id
>> for the admin operation that
>> was going on before headless. Since AMF is continuing the admin
>> operation we should somehow
>> restrict other admin operation to start by setting some magic no for
>> invocationid or any other way.
>>
>> [Minh]: If AMF is continuing the admin operation after headless, the sg
>> fsm state should not be STABLE, I think (sg_fsm_state ==
>> AVD_SG_FSM_STABLE) should be enough to reject new admin operation?
>>
>>
>> 3)If suswitch is in TOGGLED state then I think we should crosscheck that
>> there are atleast two SUs
>> having assignment. The reason is if this flag remains TOGGOLED and admin
>> op does not continue then there is very less probability that if will
>> get reset as it is used only in si-swap flow.
>>
>> [Minh]: Yes I don't particularly like this osafAmfSUSwitch to be written
>> to IMM. I had the only test case 144 failed (test list attached to 
>> ticket)
>> Test 144 is: Swap SI, delay csi STANDBY cbk in SU4, stop SCs, restart
>> SCs, reboot PL5. And I ran into the code line which requires suswitch
>>
>> void SG_2N::node_fail_su_oper(AVD_SU *su) {
>>
>> 
>> /* the SU has standby SI assignments. if the other SUs
>> switch field
>>  * is true, it is in service, having quiesced assigning 
>> state.
>>  * Send D2N-INFO_SU_SI_ASSIGN modify active all to the other
>> SU.
>>  * Change switch field to false. Change state to SG_realign.
>>  * Free all the SI assignments to this SU.
>>  */
>> if ((su_oper_list_front()->su_switch == 
>> AVSV_SI_TOGGLE_SWITCH)
>> && (su_oper_list_front()->saAmfSuReadinessState ==
>> SA_AMF_READINESS_IN_SERVICE)) {
>>
>> I think the *crosscheck* is actually a deduction of @su_switch from
>> whatever states that AMFD receives after headless. If *crosscheck* is
>> possible thing, then su_switch does not need to be checkpointed at
>> standby AMFD also.
>> In non-headless, we always need standby AMFD up-to-date all states by
>> checkpoint so that if active AMFD has gone, the standby AMFD can take
>> over by using these checkpointed states.
>> Now in headless, we also have to write these states somewhere (here is
>> IMM) so that the new active AMFD can use it.
>> It's the best that su_switch is revertible from a set of states, but
>> it's not easy to prove it's revertible from all scenario

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-24 Thread minh chau
Hi Praveen,

I have tested the patches in case all SUs of NpM/Nway belong to one 
nodegroup, it works for me. Will try the case that SUs are partially 
in/out of nodegroup tomorrow.
But please, if you have time, can you try to make NpM's ng_admin under 
FSM_SG_ADMIN. Addition to my previous comment, and also as you knew, in 
#1725 it's hard to deduce the states (sg fsm state is one of those) 
because different SG is using FSM differently in some cases. This 
inconsistency causes a difficult, instead of applying a generic solution 
for all SG, then each SG has to be treated different way. I think we 
could also see this kind of problem in future.

Thanks,
Minh

On 23/08/16 16:58, praveen malviya wrote:
> Hi Minh,
>
> I have attached patches for #1454 and #1608 in the ticket #1454.
> Please apply them in order.
>
> Thanks,
> Praveen
>
> On 23-Aug-16 11:56 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Since AMF longDn has been pushed, can you please attach a longDn rebased
>> version to ticket (both #1454 + #1608) so we can do some test?
>>
>> Thanks,
>> Minh
>>
>> On 23/08/16 15:56, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Thanks for reviewing the patch.
>>> Please see inline with [Praveen]
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>>
>>> On 23-Aug-16 5:53 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> One comment in line with [Minh]
>>>>
>>>> Thanks
>>>> Minh
>>>>
>>>> On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>>>>> osaf/services/saf/amf/amfd/include/sg.h |   1 +
>>>>>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>>>>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62
>>>>> ++-
>>>>>   3 files changed, 62 insertions(+), 5 deletions(-)
>>>>>
>>>>>
>>>>> Currently 2N, N-Way Active and NoRed models are supported for lock,
>>>>> shutdown,
>>>>> lock-in and unlock-in admin operations on NGs.
>>>>>
>>>>> This patch supports NplusM model also.
>>>>>
>>>>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h
>>>>> b/osaf/services/saf/amf/amfd/include/sg.h
>>>>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>>>>> +++ b/osaf/services/saf/amf/amfd/include/sg.h
>>>>> @@ -507,6 +507,7 @@ public:
>>>>>   uint32_t susi_failed(AVD_CL_CB *cb, AVD_SU *su,
>>>>>   struct avd_su_si_rel_tag *susi, AVSV_SUSI_ACT act,
>>>>> SaAmfHAStateT state);
>>>>>   void node_fail_si_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>> +void ng_admin(AVD_SU *su, AVD_AMF_NG *ng);
>>>>> private:
>>>>>   uint32_t su_fault_su_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>> @@ -687,6 +687,7 @@ void avd_ng_admin_state_set(AVD_AMF_NG*
>>>>>   avd_send_admin_state_chg_ntf(&ng->name,
>>>>> (SaAmfNotificationMinorIdT)SA_AMF_NTFID_NG_ADMIN_STATE,
>>>>>   old_state, ng->saAmfNGAdminState);
>>>>> +TRACE_LEAVE();
>>>>>   }
>>>>>   /**
>>>>>* @brief  Verify if Node is stable for admin operation on 
>>>>> Nodegroup
>>>>> etc.
>>>>> @@ -749,8 +750,7 @@ static SaAisErrorT check_red_model_servi
>>>>>   LOG_NO("service outage for '%s' because of
>>>>> shutdown/lock "
>>>>>   "on '%s'",sg->name.value,ng->name.value);
>>>>>   -if ((sg->sg_redundancy_model ==
>>>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) ||
>>>>> -(sg->sg_redundancy_model ==
>>>>> SA_AMF_NPM_REDUNDANCY_MODEL)) {
>>>>> +if (sg->sg_redundancy_model ==
>>>>> SA_AMF_N_WAY_REDUNDANCY_MODEL) {
>>>>>   LOG_NO("Admin op on '%s'  hosting SUs of '%s' with
>>>>> redundancy '%u' "
>>>>>   "is not supported",ng->name.value,
>>>>> sg->name.value,
>>

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-24 Thread minh chau
Hi Nagu,

Can you please apply the below patch on top of 
1725_02_V2_bugfix_resend_buffer_in_set_leds.diff?
In your test, PL3 get set_leds, but PL-4 has not, so SU2 can not respond 
su_si msg.

@Praveen: Thanks for you help, I guess SU2 need to be ready to send 
su_si msg at the same as SU1.

diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc 
b/osaf/services/saf/amf/amfd/ndfsm.cc
--- a/osaf/services/saf/amf/amfd/ndfsm.cc
+++ b/osaf/services/saf/amf/amfd/ndfsm.cc
@@ -315,6 +315,9 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_
 cb->all_nodes_synced = true;
 LOG_NO("Received node_up_msg from all nodes");
 } else {
+   if (n2d_msg->msg_info.n2d_node_up.leds_set == true)
+   avnd->veteran = true;
+
 if (avnd->node_up_msg_count == 1 &&
 (act_nd || 
n2d_msg->msg_info.n2d_node_up.leds_set)) {

@@ -415,7 +418,6 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_
 // this node is already up
 avd_node_state_set(avnd, AVD_AVND_STATE_PRESENT);
 avd_node_oper_state_set(avnd, 
SA_AMF_OPERATIONAL_ENABLED);
-   avnd->veteran = true;
 // Update readiness state of all SUs which are 
waiting for node
 // oper state
 for (const auto& su : avnd->list_of_ncs_su) {

On 24/08/16 22:05, praveen malviya wrote:
> Hi Minh,
>
> Any assignment message should be processed after cluster timer expiry 
> and node sync timer expiry. The bug fix patch 
> 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff honors cluster timer 
> expiry but not node sync timer.
> After node sync timer expiry, delayed payloads will be rebooted and if 
> these payloads host any SU/SUSIs, they will be deleted. So admin op 
> will finish gracefully.
> I think for loop can be added in both timers' expiry events with a 
> check on exipry of other timer:
>
> diff --git a/osaf/services/saf/amf/amfd/cluster.cc 
> b/osaf/services/saf/amf/amfd/cluster.cc
> --- a/osaf/services/saf/amf/amfd/cluster.cc
> +++ b/osaf/services/saf/amf/amfd/cluster.cc
> @@ -74,12 +74,13 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB
> m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(cb, cb, 
> AVSV_CKPT_AVD_CB_CONFIG);
>
> // Resend set_leds to veteran node
> -
> -   for (std::map::const_iterator it = 
> node_name_db->begin();
> -   it != node_name_db->end(); it++) {
> -   node = it->second;
> -   if (node->veteran)
> -   avd_snd_set_leds_msg(cb, node);
> +   if (cb->node_sync_tmr.is_active == false) {
> +   for (std::map::const_iterator 
> it = node_name_db->begin();
> +   it != node_name_db->end(); it++) {
> +   node = it->second;
> +   if (node->veteran)
> +   avd_snd_set_leds_msg(cb, node);
> +   }
> }
>
> /* call the realignment routine for each of the SGs in the
> @@ -143,6 +144,17 @@ void avd_node_sync_tmr_evh(AVD_CL_CB *cb
> // Setting true here to indicate the node sync window has closed
> // Further node up message will be treated specially
> cb->node_sync_window_closed = true;
> +// Resend set_leds to veteran node
> +if (cb->amf_init_tmr.is_active == false) {
> +   AVD_AVND *node = nullptr;
> +for (std::map *>::const_iterator it = node_name_db->begin();
> +it != node_name_db->end(); it++) {
> +node = it->second;
> +if (node->veteran)
> +avd_snd_set_leds_msg(cb, node);
> +}
> +}
> +
>
> TRACE_LEAVE();
>  }
>
>
>
> Thanks,
> Praveen
> On 24-Aug-16 4:58 PM, Nagendra Kumar wrote:
>> The below is the assignments after the test case (SU2 has standby 
>> assignment):
>>
>> PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd status
>> safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDemo1,safApp=AmfDemo1
>>  
>>
>> saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF 
>>
>> saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=PL-4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF 
>>
>> saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF 
>>
>> saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF 
>>
>> saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
>> saAmfSISUHAState=ACTIVE(1)
>>

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-24 Thread minh chau
Hi Nagu,

Second thought, Praveen's one is better I think. You can try with his patch.

Thanks
Minh

On 24/08/16 22:21, minh chau wrote:
> Hi Nagu,
>
> Can you please apply the below patch on top of 
> 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff?
> In your test, PL3 get set_leds, but PL-4 has not, so SU2 can not 
> respond su_si msg.
>
> @Praveen: Thanks for you help, I guess SU2 need to be ready to send 
> su_si msg at the same as SU1.
>
> diff --git a/osaf/services/saf/amf/amfd/ndfsm.cc 
> b/osaf/services/saf/amf/amfd/ndfsm.cc
> --- a/osaf/services/saf/amf/amfd/ndfsm.cc
> +++ b/osaf/services/saf/amf/amfd/ndfsm.cc
> @@ -315,6 +315,9 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_
> cb->all_nodes_synced = true;
> LOG_NO("Received node_up_msg from all nodes");
> } else {
> +   if (n2d_msg->msg_info.n2d_node_up.leds_set == 
> true)
> +   avnd->veteran = true;
> +
> if (avnd->node_up_msg_count == 1 &&
> (act_nd || 
> n2d_msg->msg_info.n2d_node_up.leds_set)) {
>
> @@ -415,7 +418,6 @@ void avd_node_up_evh(AVD_CL_CB *cb, AVD_
> // this node is already up
> avd_node_state_set(avnd, AVD_AVND_STATE_PRESENT);
> avd_node_oper_state_set(avnd, 
> SA_AMF_OPERATIONAL_ENABLED);
> -   avnd->veteran = true;
> // Update readiness state of all SUs which are 
> waiting for node
> // oper state
> for (const auto& su : avnd->list_of_ncs_su) {
>
> On 24/08/16 22:05, praveen malviya wrote:
>> Hi Minh,
>>
>> Any assignment message should be processed after cluster timer expiry 
>> and node sync timer expiry. The bug fix patch 
>> 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff honors cluster timer 
>> expiry but not node sync timer.
>> After node sync timer expiry, delayed payloads will be rebooted and 
>> if these payloads host any SU/SUSIs, they will be deleted. So admin 
>> op will finish gracefully.
>> I think for loop can be added in both timers' expiry events with a 
>> check on exipry of other timer:
>>
>> diff --git a/osaf/services/saf/amf/amfd/cluster.cc 
>> b/osaf/services/saf/amf/amfd/cluster.cc
>> --- a/osaf/services/saf/amf/amfd/cluster.cc
>> +++ b/osaf/services/saf/amf/amfd/cluster.cc
>> @@ -74,12 +74,13 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB
>> m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(cb, cb, 
>> AVSV_CKPT_AVD_CB_CONFIG);
>>
>> // Resend set_leds to veteran node
>> -
>> -   for (std::map::const_iterator it = 
>> node_name_db->begin();
>> -   it != node_name_db->end(); it++) {
>> -   node = it->second;
>> -   if (node->veteran)
>> -   avd_snd_set_leds_msg(cb, node);
>> +   if (cb->node_sync_tmr.is_active == false) {
>> +   for (std::map> *>::const_iterator it = node_name_db->begin();
>> +   it != node_name_db->end(); it++) {
>> +   node = it->second;
>> +   if (node->veteran)
>> +   avd_snd_set_leds_msg(cb, node);
>> +   }
>> }
>>
>> /* call the realignment routine for each of the SGs in the
>> @@ -143,6 +144,17 @@ void avd_node_sync_tmr_evh(AVD_CL_CB *cb
>> // Setting true here to indicate the node sync window has closed
>> // Further node up message will be treated specially
>> cb->node_sync_window_closed = true;
>> +// Resend set_leds to veteran node
>> +if (cb->amf_init_tmr.is_active == false) {
>> +   AVD_AVND *node = nullptr;
>> +for (std::map> *>::const_iterator it = node_name_db->begin();
>> +it != node_name_db->end(); it++) {
>> +node = it->second;
>> +if (node->veteran)
>> +avd_snd_set_leds_msg(cb, node);
>> +}
>> +}
>> +
>>
>> TRACE_LEAVE();
>>  }
>>
>>
>>
>> Thanks,
>> Praveen
>> On 24-Aug-16 4:58 PM, Nagendra Kumar wrote:
>>> The below is the assignments after the test case (SU2 has standby 
>>> assignment):
>>>
>>> PM_SC-1:/home/nag

Re: [devel] Review Request for ntf: update PR documentation [#1952]

2016-08-25 Thread minh chau
Hi Vu,

Ack from me for PR doc.

I see this limitation still documented in README, but I think it's just 
fine to be there (under 4.5). You can also make another paragraph to say 
it's removed in 5.1.
I'm ok with both, since nothing is interesting in 5.1 to say.

Thanks,
Minh

On 25/08/16 12:39, Vu Minh Nguyen wrote:
>
> Hi,
>
> Any comments on the PR?
>
> Regards, Vu
>
> *From:* Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
> *Sent:* Monday, August 15, 2016 6:46 PM
> *To:* Lennart Lund ; 'Minh Hon Chau' 
> ; 'praveen malviya' 
> *Cc:* opensaf-devel@lists.sourceforge.net
> *Subject:* [devel] Review Request for ntf: update PR documentation [#1952]
>
> Summary: ntf: update PR documentation [#1952]
>
> Review request for Trac Ticket(s): #1952
>
> Peer Reviewer(s): NTF maintainers
>
> Pull request to: <>
>
> Affected branch(es): default
>
> Development branch: default
>
> 
>
> Impacted area   Impact y/n
>
> 
>
> Docsy
>
> Build systemn
>
> RPM/packaging   n
>
> Configuration files n
>
> Startup scripts n
>
> SAF servicesn
>
> OpenSAF servicesn
>
> Core libraries  n
>
> Samples n
>
> Tests   n
>
> Other   n
>
> Comments (indicate scope for each "y" above):
>
> -
>
> <>
>
> Conditions of Submission:
>
> -
>
> Ack from reviewers
>
> Arch  Built StartedLinux distro
>
> ---
>
> mipsn  n
>
> mips64  n  n
>
> x86 n  n
>
> x86_64  n  n
>
> powerpc n  n
>
> powerpc64   n  n
>
> Reviewer Checklist:
>
> ---
>
> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>
> Your checkin has not passed review because (see checked entries):
>
> ___ Your RR template is generally incomplete; it has too many blank 
> entries
>
> that need proper data filled in.
>
> ___ You have failed to nominate the proper persons for review and push.
>
> ___ Your patches do not have proper short+long header
>
> ___ You have grammar/spelling in your header that is unacceptable.
>
> ___ You have exceeded a sensible line length in your 
> headers/comments/text.
>
> ___ You have failed to put in a proper Trac Ticket # into your commits.
>
> ___ You have incorrectly put/left internal data in your comments/files
>
> (i.e. internal bug tracking tool IDs, product names etc)
>
> ___ You have not given any evidence of testing beyond basic build tests.
>
> Demonstrate some level of runtime or other sanity testing.
>
> ___ You have ^M present in some of your files. These have to be removed.
>
> ___ You have needlessly changed whitespace or added whitespace crimes
>
> like trailing spaces, or spaces before tabs.
>
> ___ You have mixed real technical changes with whitespace and other
>
> cosmetic code cleanup changes. These have to be separate commits.
>
> ___ You need to refactor your submission into logical chunks; there is
>
> too much content into a single commit.
>
> ___ You have extraneous garbage in your review (merge commits etc)
>
> ___ You have giant attachments which should never have been sent;
>
> Instead you should place your content in a public tree to be pulled.
>
> ___ You have too many commits attached to an e-mail; resend as threaded
>
> commits, or place in a public tree for a pull.
>
> ___ You have resent this content multiple times without a clear indication
>
> of what has changed between each re-send.
>
> ___ You have failed to adequately and individually address all of the
>
> comments and change requests that were proposed in the initial review.
>
> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>
> ___ Your computer have a badly configured date and time; confusing the
>
> the threaded patch review.
>
> ___ Your changes affect IPC mechanism, and you don't present any results
>
> for in-service upgradability test.
>
> ___ Your changes affect user manual and documentation, your patch series
>
> do not contain the patch that updates the Doxygen manual.
>

--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-25 Thread minh chau
Hi Praveen,

I think we need to come back a bit to non-headless feature.
The cluster init timer expiry ensures all nodes having MW SUs assigned 
and node state are PRESENT. It's the unique entry point to non-ncs SU 
assignment phase.
We also need to keep this principle in headless for #1725.
...
 else {
 // this node is already up
 avd_node_state_set(avnd, AVD_AVND_STATE_PRESENT);
 avd_node_oper_state_set(avnd, SA_AMF_OPERATIONAL_ENABLED);
*avnd->veteran = true;*
 // Update readiness state of all SUs which are waiting for node
 // oper state
 ... *
**// At this point, one one has become PRESENT, its MW SUs 
should be synced **
**// We can do:**
**m_AVD_CLINIT_TMR_START(cb);**
/* Check if all SUs are in 'in-service' cluster-wide, if so 
start assignments */**
**if ((cb->amf_init_tmr.is_active == true) && **
**(cluster_su_instantiation_done(cb, nullptr) == true)) {**
**avd_stop_tmr(cb, &cb->amf_init_tmr);**
**cluster_startup_expiry_event_generate(cb);**
**}*
 goto node_joined;
 }
...
}

Also, we just need to send set led only if cluster timer init expires.
Do you think should it work?

Thanks,
Minh

On 25/08/16 17:03, praveen malviya wrote:
> Hi Minh,
>
> One minor correction is still needed.
> node_up event comes very early. In case atleast one node up event has 
> come from all amfnds then AMFD stops Node sync timer very early even 
> before cluster timer has started:
>if (rc_node_up == sync_nd_size) {
> if (cb->node_sync_tmr.is_active) {
> avd_stop_tmr(cb, &cb->node_sync_tmr);
> TRACE("stop NodeSync timer");
> }
> cb->all_nodes_synced = true;
>
> But AMFD does not process all these node up event because it is not in 
> INIT state by this time: if ((n2d_msg->msg_info.n2d_node_up.node_id != 
> cb->node_id_avd) && (cb->init_state < AVD_INIT_DONE)) {
> TRACE("invalid init state (%u), node %x",
> cb->init_state, 
> n2d_msg->msg_info.n2d_node_up.node_id);
> goto done;
> }
>
> When AMFD moves to INIT state it starts the cluster startup timer. If 
> this is very low then it will expire and it will see node sync timer 
> not running and it will send led messages. By this some nodes may be 
> in syncing state. So already synced amfnds will be sending the 
> assignment messages when some amfnds are still syncing and they may 
> host some SUs.
>
> we need to bring two things at same level:
> 1)When led set message is sent to amfnds, all amfnds should be in 
> PRESENT state. Means theere SUs are enabled and amfnd can process 
> assignments.
> 2)All other amfnd which do not join till this time will be rebooted.
>
> Problem: How to hold amfnds from sending assignment events from buffer 
> until all of them are in PRESENT state and non PRESENT will never join 
> the cluster. Neither Node sync timer nor cluster startup timer ensures 
> that all amfnds are synced and non synced will be rebooted.
>
> One idea: What if we do not stop the node sync timer when  "if 
> (rc_node_up == sync_nd_size)" is hit and mark node_sync_window_closed 
> true as done when it really expires in avd_node_sync_tmr_evh()? What 
> are the implications of it. But AMFD will still have to make sure 
> fresh node_up_event() sending nodes will be rebooted. So if cluster 
> timer expires early then it will see node sync timer running and will 
> not send led set. But will all the payloads really move to PRESENT 
> state in 10 seconds, I am relying on the chosen value of 10 seconds.
>
> Thanks,
> Praveen
>
>
>
>
> On 24-Aug-16 5:35 PM, praveen malviya wrote:
>> Hi Minh,
>>
>> Any assignment message should be processed after cluster timer expiry
>> and node sync timer expiry. The bug fix patch
>> 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff honors cluster timer
>> expiry but not node sync timer.
>> After node sync timer expiry, delayed payloads will be rebooted and if
>> these payloads host any SU/SUSIs, they will be deleted. So admin op will
>> finish gracefully.
>> I think for loop can be added in both timers' expiry events with a check
>> on exipry of other timer:
>>
>> diff --git a/osaf/services/saf/amf/amfd/cluster.cc
>> b/osaf/services/saf/amf/amfd/cluster.cc
>> --- a/osaf/services/saf/amf/amfd/cluster.cc
>> +++ b/osaf/services/saf/amf/amfd/cluster.cc
>> @@ -74,12 +74,13 @@ void avd_cluster_tmr_init_evh(AVD_CL_CB
>> m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(cb, cb, 
>> AVSV_CKPT_AVD_CB_CONFIG);
>>
>> // Resend set_leds to veteran node
>> -
>> -   for (std::map::const_iterator it =
>> node_name_db->begin();
>> -   it != node_name_db->end(); it++) {
>> -   node = it->second;
>>

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-25 Thread minh chau
Hi,

The test failed because two reasons:
1. There are two places that nodegroup operation borrows 2N SG FSM, but 
the AdminState of SG is not stored to IMM
 saAmfSGAdminState = ng->saAmfNGAdminState;
 ...
 su->sg_of_su->saAmfSGAdminState = SA_AMF_ADMIN_UNLOCKED;

This setting needs to be called by AVD_SG::set_admin_state()

2. After receives su_si assignment response after headless, @admin_ng, 
@ng_using_saAmfSGAdminState have not been restored.
They need to be restored by somehow. Since nodegroup allows to be 
created at any adminState. So there should be the case nodegroup's 
AdminState is created with LOCKED but the belonging SUs are still having 
assignment, so adminState of nodegroup can't be used.
The admin_ng, ng_using_saAmfSGAdminState seem need to be stored to IMM?
@Praveen: any suggestions?

 } else if ((su->sg_of_su->sg_ncs_spec == false) && 
((su->su_on_node->admin_ng != nullptr) ||
 (su->sg_of_su->ng_using_saAmfSGAdminState == true))) {
 AVD_AMF_NG *ng = su->su_on_node->admin_ng;
 //Got response from AMFND for assignments decrement 
su_cnt_admin_oper.
 if ((ng != nullptr) &&
 (ng->admin_ng_pend_cbk.admin_oper == 
SA_AMF_ADMIN_SHUTDOWN) ||
 (ng->admin_ng_pend_cbk.admin_oper == SA_AMF_ADMIN_LOCK)) &&
 (su->saAmfSUNumCurrActiveSIs == 0) && 
(su->saAmfSUNumCurrStandbySIs == 0) &&
 (AVSV_SUSI_ACT_DEL == 
n2d_msg->msg_info.n2d_su_si_assign.msg_act))) ||
 (ng->admin_ng_pend_cbk.admin_oper == 
SA_AMF_ADMIN_UNLOCK))) {
 su->su_on_node->su_cnt_admin_oper--;
 TRACE("node:'%s', su_cnt_admin_oper:%u",
su->su_on_node->name.c_str(),su->su_on_node->su_cnt_admin_oper);
 }
 process_su_si_response_for_ng(su, SA_AIS_OK);

On 25/08/16 21:36, Nagendra Kumar wrote:
> Further testing results:
> Node group lock has resulted in SG unstable. Logs and configuration file 
> attached.
>
> Configuration : SC-1, PL-3 and PL-4.
>
> Steps:
>
> 1. Unlock SU1(on PL-3), SU2 and SU3 (Both on PL-4).
> 2. Create node group of PL-3 and PL-4:
> 3. Lock the node group.
> amf-adm lock  safAmfNodeGroup=nagu,safAmfCluster=myAmfCluster
> 4. Keep gdb in csi set callback, stop SC-1 and start respond OK from csi set 
> callback and start SC-1.
>
> SG becomes unstable if you try to unlock the Node group:
> Aug 25 16:57:06 PM_SC-1 osafamfd[2166]: NO 'safSg=AmfDemo_2N,safApp=AmfDemo1' 
> is in unstable/transition state
>
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: Nagendra Kumar
>> Sent: 24 August 2016 16:58
>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if
>> csi callback completes during headless [#1725 part 1] V1
>>
>> The below is the assignments after the test case (SU2 has standby
>> assignment):
>>
>> PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd  status
>> safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
>> mo1,safApp=AmfDemo1
>>  saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=PL-
>> 3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=PL-
>> 4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-
>> 2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-
>> 1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=STANDBY(2)
>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>> 2N,safApp=OpenSAF
>>  saAmfSISUHAState=ACTIVE(1)
>>
>> Thanks
>> -Nagu
>>
>>> -Original Message-
>>> From: Nagendra Kumar
>>> Sent: 24 August 2016 16:55
>>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation
>>> continuation if csi callback completes during headless [#1725 part 1]
>>> V1
>>>
>>> Hi Minh,
>>> With 1725_phase_1_V2.tgz, the below email TC has failed. Please
>> find
>>> the traces attached along with the configuration in the ticket.
>>>
>>> Thanks
>>> -Nagu
>>>
 -Original Message-
 From: Nagendra Kumar
 Sent: 23 August 2016 15:15
 To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
 gary@dektech.com.au; long.hb.ngu...@dektech.com.au
 Cc: opensaf-devel@lists.sourceforge.net
 Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation
 continuation if csi callback completes during headless [#1725 part
>

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-25 Thread minh chau
Hi Praveen,

Just to confirm if I understand correctly the problem you mentioned in 
saAmfFinalize(). As the specification says application should call 
free()/Free_4() to release the allocated memory, if application does not 
release memory then it's most likely application misuses API. Or do you 
mean saAmfFinalize() should do the same as saNtfFinalize()? if this is 
the case it looks just an enhancement to guard AMFA?

Thanks
Minh

On 25/08/16 20:04, praveen malviya wrote:
> Hi Minh,
>
> AMFA currently does not remember the allocated memory. It relies on 
> the application always to free the memory. In saAmfFinalize() also, it 
> does not free the memory. I think AMFA should remember memory by 
> associating it with handle because process which is starting PG 
> tracking may not be a component and may not call free()/Free_4() and 
> just relies on saAmfFinalize() call to release all the resources.
>
> I think as of now please go ahead with your suggested solution. From 
> finalize perspective this is a defect and applicable to all the 
> branches. So please raise a ticket for that.
>
> Thanks,
> Praveen
>
>
> On 22-Aug-16 12:08 PM, minh chau wrote:
>> Hi Praveen,
>>
>> The case you just mentioned is still in callback context, so Agent can
>> help application to release the allocated notification. But still
>> another case:
>>
>> +SaAmfProtectionGroupNotificationBufferT buff;
>> +buff.notification = NULL;
>> +rc = saAmfProtectionGroupTrack_4(my_amf_hdl, &track_csi,
>> SA_TRACK_CURRENT, &buff);
>> +if (rc != SA_AIS_OK) {
>> +syslog(LOG_ERR, "saAmfProtectionGroupTrack FAILED - %u", rc);
>> +goto done;
>> +}
>>
>> In this case Agent has to allocate notification but it's not in Agent's
>> context.
>> Application has to call API Free_4(buff.notification) to free up
>> notification.
>> In order to iterate to free longDn(s) inside Free_4(), Agent has to
>> memorize a list numberOfItems for every single call as above Track_4(),
>> or Agent can add sentinel element to the allocated notification.
>>
>> Thanks,
>> Minh
>>
>> On 22/08/16 15:34, praveen malviya wrote:
>>> Hi,
>>> The callback looks like this:
>>> typedef void
>>> (*SaAmfProtectionGroupTrackCallbackT_4)(
>>> const SaNameT *csiName,
>>> SaAmfProtectionGroupNotificationBufferT_4 *notificationBuffer,
>>> SaUint32T numberOfMembers,
>>> SaAisErrorT error);
>>>
>>> Inside this callback, application is supposed to call
>>> saAmfProtectionGroupNotificationFree_4(). So agent must be able to
>>> deduce this information as SaAmfProtectionGroupNotificationBufferT_4
>>> contains numberOfItems and also numberOfMembers is available from
>>> callback.
>>> Since B.04.01 APIs are not fully implemented, agent copies from old
>>> type of structure to new type in ava_cpy_protection_group_ntf().
>>>
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 22-Aug-16 10:51 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> The problem with B.04.01 is the API:
>>>> saAmfProtectionGroupNotificationFree_4(SaAmfHandleT hdl,
>>>> SaAmfProtectionGroupNotificationT_4 *notification) does not have
>>>> numberOfItems.
>>>> Agent does not know how many element in *notification, each of element
>>>> can hide a longDn inside it.
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>>
>>>> On 22/08/16 15:04, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> SaAmfProtectionGroupNotificationBufferT_4() contains numberOfItems to
>>>>> iterate over. In case of B.04.01, it should be simple as agent can
>>>>> call direclty osaf_extended_name_free() during iteration inside
>>>>> saAmfProtectionGroupNotificationFree_4(). So I think, only a for loop
>>>>> which will iterate over numberOfItems is required.
>>>>>
>>>>> Problem was in B.01.01 case, where application will have to iterate
>>>>> and free the memory. For this, Long has already suggested and that
>>>>> needs to be documented.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>>
>>>>> On 20-Aug-16 2:22 PM, minh chau wrote:
>>>>>> Hi Long, Praveen,
>>>>>>
>>>>>> Regarding this TODO
>>>>>>

Re: [devel] [PATCH 1 of 1] amfa: fixed freeing notification buff [#1642]

2016-08-25 Thread minh . chau
Hi Praveen,

Long is preparing the patch adding sentinel element.

Thanks,
Minh

> Hi Minh,
>
> I think this is subject to interpretation. After finalize(), handle
> becomes invalid. So an application cannot call Free_4() to free any
> memory. In such a case, freeing any resources associated with this
> handle in finalize() seems to be ok.
>   Anyways freeing in finalize() can be postponed for any real use case
> to come up.In that case please go ahead by adding sentinel element.
> I think there are not other pending things in #1642 other than this.
>
> Thanks,
> Praveen
>
>
>
> On 26-Aug-16 7:35 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Just to confirm if I understand correctly the problem you mentioned in
>> saAmfFinalize(). As the specification says application should call
>> free()/Free_4() to release the allocated memory, if application does not
>> release memory then it's most likely application misuses API. Or do you
>> mean saAmfFinalize() should do the same as saNtfFinalize()? if this is
>> the case it looks just an enhancement to guard AMFA?
>>
>> Thanks
>> Minh
>>
>> On 25/08/16 20:04, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> AMFA currently does not remember the allocated memory. It relies on
>>> the application always to free the memory. In saAmfFinalize() also, it
>>> does not free the memory. I think AMFA should remember memory by
>>> associating it with handle because process which is starting PG
>>> tracking may not be a component and may not call free()/Free_4() and
>>> just relies on saAmfFinalize() call to release all the resources.
>>>
>>> I think as of now please go ahead with your suggested solution. From
>>> finalize perspective this is a defect and applicable to all the
>>> branches. So please raise a ticket for that.
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 22-Aug-16 12:08 PM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> The case you just mentioned is still in callback context, so Agent can
>>>> help application to release the allocated notification. But still
>>>> another case:
>>>>
>>>> +SaAmfProtectionGroupNotificationBufferT buff;
>>>> +buff.notification = NULL;
>>>> +rc = saAmfProtectionGroupTrack_4(my_amf_hdl, &track_csi,
>>>> SA_TRACK_CURRENT, &buff);
>>>> +if (rc != SA_AIS_OK) {
>>>> +syslog(LOG_ERR, "saAmfProtectionGroupTrack FAILED - %u", rc);
>>>> +goto done;
>>>> +}
>>>>
>>>> In this case Agent has to allocate notification but it's not in
>>>> Agent's
>>>> context.
>>>> Application has to call API Free_4(buff.notification) to free up
>>>> notification.
>>>> In order to iterate to free longDn(s) inside Free_4(), Agent has to
>>>> memorize a list numberOfItems for every single call as above
>>>> Track_4(),
>>>> or Agent can add sentinel element to the allocated notification.
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 22/08/16 15:34, praveen malviya wrote:
>>>>> Hi,
>>>>> The callback looks like this:
>>>>> typedef void
>>>>> (*SaAmfProtectionGroupTrackCallbackT_4)(
>>>>> const SaNameT *csiName,
>>>>> SaAmfProtectionGroupNotificationBufferT_4 *notificationBuffer,
>>>>> SaUint32T numberOfMembers,
>>>>> SaAisErrorT error);
>>>>>
>>>>> Inside this callback, application is supposed to call
>>>>> saAmfProtectionGroupNotificationFree_4(). So agent must be able to
>>>>> deduce this information as SaAmfProtectionGroupNotificationBufferT_4
>>>>> contains numberOfItems and also numberOfMembers is available from
>>>>> callback.
>>>>> Since B.04.01 APIs are not fully implemented, agent copies from old
>>>>> type of structure to new type in ava_cpy_protection_group_ntf().
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>> On 22-Aug-16 10:51 AM, minh chau wrote:
>>>>>> Hi Praveen,
>>>>>>
>>>>>> The problem with B.04.01 is the API:
>>>>>> saAmfProtectionGroupNotificationFree_4(SaAmfHandleT hdl,
>>>>>> SaAmfProtectionGroupNotificationT_4 *notification) does not h

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-28 Thread minh chau
Hi Praveen,

Thanks for looking through the patch.
The potential problem of restoring nodegroup because nodegroup allows to 
be created in LOCKED while the SUs are having assignment, this could 
cause an ambiguity for AMFD after headless. For example:
Suppose having SU4 hosted on PL4, SU5 hosted on PL5, SU4 has active 
assignment, SU5 has standby assignment.
case 1: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock PL4, 
delay quiesced csi cbk, stop SC, restart SC.
case 2: Create nodegroup (PL4 + PL5) with LOCKED, lock nodegroup, delay 
quiesced csi cbk, stop SC, restart SC.

if case 2 actually happened before headless, then @admin_ng and 
@ng_using_saAmfSGAdminState needs to be restored, otherwise 
process_su_si_response_for_ng() won't be called and saAmfSGAdminState 
remains LOCKED and SG is still not STABLE state.

But in both cases, after headless, AMFD sees all PLs are LOCKED, 
nodegroup is LOCKED, SU4 has pending quiesced csi cbk, thus they are 
running into the same code flow. In case 1, @admin_ng and 
@ng_using_saAmfSGAdminState should not bet set since case1 was not 
nodegroup operation before headless.
I have run a test of both cases, they are working with the patch 
attached in ticket, but it still looks a potential problem since all 
cases are not transparent to AMFD after headless, the @admin_ng and 
@ng_using_asAmfSGAdminState maybe get hit in some points in case 1

If case 1 looks ok to you from nodegroup point of view, then I will 
float the patch for review.

Thanks,
Minh


On 26/08/16 16:08, praveen malviya wrote:
> Hi,
>
> I have gone through amfd traces. Also patch for NG seems to be ok but 
> some minor can be done.
>
> As pointed by Minh, when whole SG is mapped in NG (say case a), AMFD 
> uses SG_ADMIN flow and SG admin state without exposing it to the user 
> through IMM for 2N model. In the other case when only one SU is 
> assigned in NG (say case b) there should not be any problem because 
> operation fully depends on NG admin state. Since other case b) does 
> not use SG admin state and ng_using_saAMfSGAdminState, it should work 
> fine.
>
> I think we can take the help of following facts and functions to 
> improve the patch and with that restoring ng_using_saAmfSGAdminState 
> from IMM may not be required:
> 1)In normal cluster, if controller switchover/failover happens when NG 
> operation is going on then standby controller continues admin 
> operation with information that it gets through CKPT updates in 
> dec_sg_admin_state() and dec_ng_admin_state(). Active controller never 
> checkpoints ng_using_saAmfSGAdminState and deduce it in these 
> functions.The situation after headless is almost like that.
> I think, in case a when shutdown operation is going on, admin 
> state of NG is still SHUTTING_DOWN and system becomes headless, 
> requires more params and not the lock operation. In shutdown 
> operation, AMFD has to ensure transition of NG and Nodes to LOCKED state.
>
> 2)Like controller fail-over/switch-over after headless also, we are 
> not bound to reply to IMM for admin operation completion. So we need 
> to analyse if we require to restore node->admin_ng. Half of the code 
> in process_su_si_response_for_ng() is for tracking the state of admin 
> operation so that AMFD replies to IMM for admin operation and this is 
> not required after headless state.
>
> I think problem is not that much complex as it is valid for only 2N 
> models and only in case a).
>
> Thanks,
> Praveen
>
> On 25-Aug-16 6:38 PM, minh chau wrote:
>> Hi,
>>
>> The test failed because two reasons:
>> 1. There are two places that nodegroup operation borrows 2N SG FSM, but
>> the AdminState of SG is not stored to IMM
>> saAmfSGAdminState = ng->saAmfNGAdminState;
>> ...
>> su->sg_of_su->saAmfSGAdminState = SA_AMF_ADMIN_UNLOCKED;
>>
>> This setting needs to be called by AVD_SG::set_admin_state()
>>
>> 2. After receives su_si assignment response after headless, @admin_ng,
>> @ng_using_saAmfSGAdminState have not been restored.
>> They need to be restored by somehow. Since nodegroup allows to be
>> created at any adminState. So there should be the case nodegroup's
>> AdminState is created with LOCKED but the belonging SUs are still having
>> assignment, so adminState of nodegroup can't be used.
>> The admin_ng, ng_using_saAmfSGAdminState seem need to be stored to IMM?
>> @Praveen: any suggestions?
>>
>> } else if ((su->sg_of_su->sg_ncs_spec == false) &&
>> ((su->su_on_node->admin_ng != nullptr) ||
>> (su->sg_of_su->ng_using_saAmfSGAdminState == true))) {
>> AVD_AMF_NG *ng = su->su_on_node->admin_ng;
>> //Got response 

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-28 Thread minh chau
Hi Nagu,

Thanks for your time to verify the patches.
Yes, versioning changes for AMFD-AMFD is not required, I will remove it 
before push

Thanks,
Minh

On 26/08/16 22:13, Nagendra Kumar wrote:
> Hi Minh,
>   Ack for patches (1725_01_V5_intro_new_rta_states_longDn.diff, 
> 1725_02_V2_resend_su_si_assign_msg_longDn.diff, 
> 1725_02_V2_bugfix_resend_buffer_in_set_leds.diff and 
> 1725_02_V2_bugfix_1_honor_cluster_sync_timer.diff) for phase 1 with the 
> following comments:
>
> 1. Let us document that the nodegroup is not supported. We will remove the 
> excerpts  if we can fix it after FC and before RC.
> 2. Reported issue of sg unstable, you can work after FC. But this is 
> mandatory to fix before RC/GA.
> 3. I think, the versioning changes are not required (at least between 
> Amfd-Amfd), can you please confirm.
>
> The following are the tests done:
>
> Tested the following operations by responding respective callback or 
> script(mentioned below) in two cases:
> 1. After headless and before recovery.
> 2. After headless recovery.
>
> SU:
> 
> lock: Quisced, Act, Standby(for SU3)
> unlock: Act
> shutdown: Quiscing.
> lock-in : terminate
> unlock-in: instantiate
> 
>
> Comp:
> 
> Restart: Instantiate
> 
>
> SI:
> 
> Unlock: Act
> Lock: Quisced, Remove
> Shutdown: Quiscing, Remove
> SI-Swap: quisced, Act, Standby
> 
>
> SG:
> 
> Lock: quisced
> Unlock: Act
> Lock-in: Terminate
> Unlock-in: Instantiate
> 
>
> Node:
> 
> Lock: Quisced
> Unlock: Act
> 
>
> Node Group:
> 
> Unlock: Act
> Lock: this test case failed.
> 
>
> Thanks
> -Nagu
>
>> -Original Message-
>> From: Nagendra Kumar
>> Sent: 25 August 2016 17:07
>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if
>> csi callback completes during headless [#1725 part 1] V1
>>
>> Further testing results:
>> Node group lock has resulted in SG unstable. Logs and configuration file
>> attached.
>>
>> Configuration : SC-1, PL-3 and PL-4.
>>
>> Steps:
>>
>> 1. Unlock SU1(on PL-3), SU2 and SU3 (Both on PL-4).
>> 2. Create node group of PL-3 and PL-4:
>> 3. Lock the node group.
>> amf-adm lock  safAmfNodeGroup=nagu,safAmfCluster=myAmfCluster
>> 4. Keep gdb in csi set callback, stop SC-1 and start respond OK from csi set
>> callback and start SC-1.
>>
>> SG becomes unstable if you try to unlock the Node group:
>> Aug 25 16:57:06 PM_SC-1 osafamfd[2166]: NO
>> 'safSg=AmfDemo_2N,safApp=AmfDemo1' is in unstable/transition state
>>
>>
>> Thanks
>> -Nagu
>>
>>> -Original Message-
>>> From: Nagendra Kumar
>>> Sent: 24 August 2016 16:58
>>> To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
>>> gary@dektech.com.au; long.hb.ngu...@dektech.com.au
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation
>>> continuation if csi callback completes during headless [#1725 part 1]
>>> V1
>>>
>>> The below is the assignments after the test case (SU2 has standby
>>> assignment):
>>>
>>> PM_SC-1:/home/nagu/views/staging-1725 # /etc/init.d/opensafd  status
>>>
>> safSISU=safSu=SU2\,safSg=AmfDemo_2N\,safApp=AmfDemo1,safSi=AmfDe
>>> mo1,safApp=AmfDemo1
>>>  saAmfSISUHAState=STANDBY(2)
>>> safSISU=safSu=PL-
>>> 3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed3,safApp=OpenSAF
>>>  saAmfSISUHAState=ACTIVE(1)
>>> safSISU=safSu=PL-
>>> 4\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
>>>  saAmfSISUHAState=ACTIVE(1)
>>> safSISU=safSu=SC-
>>> 2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed4,safApp=OpenSAF
>>>  saAmfSISUHAState=ACTIVE(1)
>>> safSISU=safSu=SC-
>>> 1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
>>>  saAmfSISUHAState=ACTIVE(1)
>>> safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>>> 2N,safApp=OpenSAF
>>>  saAmfSISUHAState=STANDBY(2)
>>> safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-
>>> 2N,safApp=OpenSAF
>>>  saAmfSISUHAState=ACTIVE(1)
>>>
>>> Thanks
>>> -Nagu
>>>
 -Original Message-
 From: Nagendra Kumar
 Sent: 24 August 2016 16:55
 To: Minh Hon Chau; hans.nordeb...@ericsson.com; Praveen Malviya;
 gary@dektech.com.au; long.hb.ngu...@dektech.com.au
 Cc: opensaf-devel@lists.sourceforge.net
 Subject: Re: [devel] [PATCH 2 of 2] AMFND: Admin operation
 continuation if csi callback completes during headless [#1725 part
 1]
 V1

 Hi Minh,
With 1725_phase_1_V2.tgz, the below email TC has failed. Please
>>> find
 the traces attached along with the configuration in the ticket.


Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-28 Thread minh chau
Hi Praveen,

Ack (normal test only), the abnormal failover is supposed to work in 
REALIGN in NpM and Nway I think
Maybe you can add a note in README to revisit the consistency of using 
SG FSM for nodegroup in all SGs?

Thanks,
Minh

On 24/08/16 19:02, praveen malviya wrote:
> Hi Minh,
>
> Please responses inline with [Praveen]
>
> Thanks,
> Praveen
>
> On 24-Aug-16 2:01 PM, minh chau wrote:
>> Hi Praveen,
>>
>> I have tested the patches in case all SUs of NpM/Nway belong to one
>> nodegroup, it works for me. Will try the case that SUs are partially
>> in/out of nodegroup tomorrow.
>> But please, if you have time, can you try to make NpM's ng_admin under
>> FSM_SG_ADMIN. Addition to my previous comment, and also as you knew, in
>> #1725 it's hard to deduce the states (sg fsm state is one of those)
>> because different SG is using FSM differently in some cases. This
>> inconsistency causes a difficult, instead of applying a generic solution
>> for all SG, then each SG has to be treated different way. I think we
>> could also see this kind of problem in future.
> [Praveen] In 5.1, #1725 is targeting only 2N model.In future when 
> #1725 will support other red models (upcoming releases), I would do 
> any changes that is required in other red models for NG operations.
>
>
>> Thanks,
>> Minh
>>
>> On 23/08/16 16:58, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> I have attached patches for #1454 and #1608 in the ticket #1454.
>>> Please apply them in order.
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 23-Aug-16 11:56 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> Since AMF longDn has been pushed, can you please attach a longDn 
>>>> rebased
>>>> version to ticket (both #1454 + #1608) so we can do some test?
>>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 23/08/16 15:56, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> Thanks for reviewing the patch.
>>>>> Please see inline with [Praveen]
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>>
>>>>>
>>>>> On 23-Aug-16 5:53 AM, minh chau wrote:
>>>>>> Hi Praveen,
>>>>>>
>>>>>> One comment in line with [Minh]
>>>>>>
>>>>>> Thanks
>>>>>> Minh
>>>>>>
>>>>>> On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>>>>>>> osaf/services/saf/amf/amfd/include/sg.h |   1 +
>>>>>>>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>>>>>>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62
>>>>>>> ++-
>>>>>>>   3 files changed, 62 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> Currently 2N, N-Way Active and NoRed models are supported for lock,
>>>>>>> shutdown,
>>>>>>> lock-in and unlock-in admin operations on NGs.
>>>>>>>
>>>>>>> This patch supports NplusM model also.
>>>>>>>
>>>>>>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>> b/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>> +++ b/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>> @@ -507,6 +507,7 @@ public:
>>>>>>>   uint32_t susi_failed(AVD_CL_CB *cb, AVD_SU *su,
>>>>>>>   struct avd_su_si_rel_tag *susi, AVSV_SUSI_ACT act,
>>>>>>> SaAmfHAStateT state);
>>>>>>>   void node_fail_si_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>>>> +void ng_admin(AVD_SU *su, AVD_AMF_NG *ng);
>>>>>>> private:
>>>>>>>   uint32_t su_fault_su_oper(AVD_CL_CB *cb, AVD_SU *su);
>>>>>>> diff --git a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>>>> b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>>>> --- a/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>>>> +++ b/osaf/services/saf/amf/amfd/nodegroup.cc
>>>>>>> @@ -687,6 +687,7 @@ void avd_ng_admin_state_set(AVD_AMF_NG*
>>>>>>>   avd_send_admin_state_chg_ntf(&ng->name,
>>>>>>> (SaA

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-29 Thread minh chau
Hi Praveen,

Yes it's for #1454 and #1608

Thanks,
Minh

On 29/08/16 17:08, praveen malviya wrote:
> Hi Minh,
>
> Thanks for reviewing and testing.
> I guess the ack is for both #1454 and #1608.
> I will add a note.
>
>
> Thanks,
> Praveen
>
> On 29-Aug-16 12:21 PM, minh chau wrote:
>> Hi Praveen,
>>
>> Ack (normal test only), the abnormal failover is supposed to work in
>> REALIGN in NpM and Nway I think
>> Maybe you can add a note in README to revisit the consistency of using
>> SG FSM for nodegroup in all SGs?
>>
>> Thanks,
>> Minh
>>
>> On 24/08/16 19:02, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Please responses inline with [Praveen]
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 24-Aug-16 2:01 PM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> I have tested the patches in case all SUs of NpM/Nway belong to one
>>>> nodegroup, it works for me. Will try the case that SUs are partially
>>>> in/out of nodegroup tomorrow.
>>>> But please, if you have time, can you try to make NpM's ng_admin under
>>>> FSM_SG_ADMIN. Addition to my previous comment, and also as you 
>>>> knew, in
>>>> #1725 it's hard to deduce the states (sg fsm state is one of those)
>>>> because different SG is using FSM differently in some cases. This
>>>> inconsistency causes a difficult, instead of applying a generic 
>>>> solution
>>>> for all SG, then each SG has to be treated different way. I think we
>>>> could also see this kind of problem in future.
>>> [Praveen] In 5.1, #1725 is targeting only 2N model.In future when
>>> #1725 will support other red models (upcoming releases), I would do
>>> any changes that is required in other red models for NG operations.
>>>
>>>
>>>> Thanks,
>>>> Minh
>>>>
>>>> On 23/08/16 16:58, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> I have attached patches for #1454 and #1608 in the ticket #1454.
>>>>> Please apply them in order.
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>>>>> On 23-Aug-16 11:56 AM, minh chau wrote:
>>>>>> Hi Praveen,
>>>>>>
>>>>>> Since AMF longDn has been pushed, can you please attach a longDn
>>>>>> rebased
>>>>>> version to ticket (both #1454 + #1608) so we can do some test?
>>>>>>
>>>>>> Thanks,
>>>>>> Minh
>>>>>>
>>>>>> On 23/08/16 15:56, praveen malviya wrote:
>>>>>>> Hi Minh,
>>>>>>>
>>>>>>> Thanks for reviewing the patch.
>>>>>>> Please see inline with [Praveen]
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Praveen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 23-Aug-16 5:53 AM, minh chau wrote:
>>>>>>>> Hi Praveen,
>>>>>>>>
>>>>>>>> One comment in line with [Minh]
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Minh
>>>>>>>>
>>>>>>>> On 20/07/16 18:57, praveen.malv...@oracle.com wrote:
>>>>>>>>> osaf/services/saf/amf/amfd/include/sg.h |   1 +
>>>>>>>>>   osaf/services/saf/amf/amfd/nodegroup.cc  |   4 +-
>>>>>>>>>   osaf/services/saf/amf/amfd/sg_npm_fsm.cc |  62
>>>>>>>>> ++-
>>>>>>>>>   3 files changed, 62 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently 2N, N-Way Active and NoRed models are supported for 
>>>>>>>>> lock,
>>>>>>>>> shutdown,
>>>>>>>>> lock-in and unlock-in admin operations on NGs.
>>>>>>>>>
>>>>>>>>> This patch supports NplusM model also.
>>>>>>>>>
>>>>>>>>> diff --git a/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>>>> b/osaf/services/saf/amf/amfd/include/sg.h
>>>>>>>>> --- a/osaf/services/saf/amf/amfd/include/sg.h
>>&g

Re: [devel] [PATCH 1 of 1] amfd: support NplusM model for supported admin ops on NG [#1454]

2016-08-29 Thread minh chau
Yes from me, but how's about AMF maintainers?
Thanks,
Minh

On 29/08/16 21:15, praveen malviya wrote:
> Hi Minh,
>
> I am pushing this README_NODEGROUP in amf directory.
> Any comment, will be updated post FC.
> Do you agree?
>
> Thanks,
> Praveen
>
>
> On 29-Aug-16 2:46 PM, praveen malviya wrote:
>> Hi Minh,
>>
>> Following content will go in Readme:
>>
>> High Level Implementation Notes for LOCK and SHUTDOWN operation on NG:
>> ===
>>
>> In 2N model, there are broadly two cases for LOCK and SHUTDOWN operation
>> on NG:
>> A) Whole 2N SG is mapped in NG which means all assigned SU are hosted on
>> thenodes of NG. Or
>> B) SG is partially mapped in NG i.e either active SU or standby SU is
>> hostedon one of the nodes of NG.
>>
>> Currently 2N model supports SI dep with in SU. So in LOCK and SHUTDOWN
>> operations quiesced and quiescing HA states, should be given honoring SI
>> deps respectively.
>> In case A), AVD_SG_FSM_SG_ADMIN is used in case A) as it honors si dep
>> while giving quiesced or quiescing assignments.
>> Case B) becomes the case of either lock of standby Node/SU or active
>> Node/Su which still needs to be handled honoring SI dep. Here
>> AVD_SG_FSM_SU_OPER fsm state is used via internal function
>> su_admin_down() as it handles SI dep.
>> So in 2N model, AMFD always uses internal FSM functions by calling
>> them inside the wrapper function ng_admin().
>>
>> Other red models do not support SI deps within SU as of now. So in these
>> models, AMFD uses AVD_SG_FSM_SG_REALIGN by keeping multiple SUs in
>> operation list while performing LOCK and SHUTDOWN operation on NG. Once
>> SI dep is fully supported in these modles, AMFD can use internal SG FSM
>> state like 2N.
>>
>> However there is one case, where AMFD can still use AVD_SG_FSM_SG_ADMIN
>> state for these red models when whole SG is mapped in NG (all assigned
>> SUs are hosted on on the nodes of NG). But possibility of such a case
>> is more in 2N model where only two SUs can be assigned anytime. In other
>> red models, there can be many assigned SUs so possibility of whole SG is
>> mapped in NG is very less. So, as of now, in other models
>> AVD_SG_FSM_SG_REALIGN states is used by keeping multiple SUs in oper
>> list for this case also.
>> But when SI deps, is completely supported with in SU in these models,
>> then AMFD cannot use realign state and it will have to use internal FSM
>> code as it will be enhanced for that.
>>
>>
>> Thanks,
>> Praveemn
>>
>>
>>
>> On 29-Aug-16 12:44 PM, minh chau wrote:
>>> Hi Praveen,
>>>
>>> Yes it's for #1454 and #1608
>>>
>>> Thanks,
>>> Minh
>>>
>>> On 29/08/16 17:08, praveen malviya wrote:
>>>> Hi Minh,
>>>>
>>>> Thanks for reviewing and testing.
>>>> I guess the ack is for both #1454 and #1608.
>>>> I will add a note.
>>>>
>>>>
>>>> Thanks,
>>>> Praveen
>>>>
>>>> On 29-Aug-16 12:21 PM, minh chau wrote:
>>>>> Hi Praveen,
>>>>>
>>>>> Ack (normal test only), the abnormal failover is supposed to work in
>>>>> REALIGN in NpM and Nway I think
>>>>> Maybe you can add a note in README to revisit the consistency of 
>>>>> using
>>>>> SG FSM for nodegroup in all SGs?
>>>>>
>>>>> Thanks,
>>>>> Minh
>>>>>
>>>>> On 24/08/16 19:02, praveen malviya wrote:
>>>>>> Hi Minh,
>>>>>>
>>>>>> Please responses inline with [Praveen]
>>>>>>
>>>>>> Thanks,
>>>>>> Praveen
>>>>>>
>>>>>> On 24-Aug-16 2:01 PM, minh chau wrote:
>>>>>>> Hi Praveen,
>>>>>>>
>>>>>>> I have tested the patches in case all SUs of NpM/Nway belong to one
>>>>>>> nodegroup, it works for me. Will try the case that SUs are 
>>>>>>> partially
>>>>>>> in/out of nodegroup tomorrow.
>>>>>>> But please, if you have time, can you try to make NpM's ng_admin 
>>>>>>> under
>>>>>>> FSM_SG_ADMIN. Addition to my previous comment, and also as you
>>>>>>> knew, in
>>>>>>> #1725 it's ha

Re: [devel] [PATCH 1 of 1] amfnd: cppcheck warnings with severity error [#1642]

2016-08-29 Thread minh chau
Ack,
Thanks, Minh

On 29/08/16 16:25, Long HB Nguyen wrote:
>   osaf/services/saf/amf/amfnd/cbq.cc |  3 +--
>   1 files changed, 1 insertions(+), 2 deletions(-)
>
>
> diff --git a/osaf/services/saf/amf/amfnd/cbq.cc 
> b/osaf/services/saf/amf/amfnd/cbq.cc
> --- a/osaf/services/saf/amf/amfnd/cbq.cc
> +++ b/osaf/services/saf/amf/amfnd/cbq.cc
> @@ -599,8 +599,7 @@ uint32_t avnd_evt_tmr_cbk_resp_evh(AVND_
>   }
>   /* treat it as comp failure (determine the recommended 
> recovery) */
>   if (AVSV_AMF_HC == rec->cbk_info->type) {
> - AVND_COMP_HC_REC tmp_hc_rec;
> - memset(&tmp_hc_rec, '\0', sizeof(AVND_COMP_HC_REC));
> + AVND_COMP_HC_REC tmp_hc_rec = {};
>   tmp_hc_rec.key = rec->cbk_info->param.hc.hc_key;
>   tmp_hc_rec.req_hdl = rec->cbk_info->hdl;
>   hc_rec = m_AVND_COMPDB_REC_HC_GET(*(rec->comp), 
> tmp_hc_rec);
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 1 of 1] amfd: fix cppcheck errors [#1642]

2016-08-29 Thread minh chau
Ack
Thanks,Minh

On 29/08/16 16:13, Gary Lee wrote:
>   osaf/services/saf/amf/amfd/ckpt_enc.cc |  2 +-
>   osaf/services/saf/amf/amfd/ndproc.cc   |  8 
>   osaf/services/saf/amf/amfd/role.cc |  3 +--
>   3 files changed, 6 insertions(+), 7 deletions(-)
>
>
> fix cppcheck errors introduced by changing various members of structs to 
> std::string
>
> diff --git a/osaf/services/saf/amf/amfd/ckpt_enc.cc 
> b/osaf/services/saf/amf/amfd/ckpt_enc.cc
> --- a/osaf/services/saf/amf/amfd/ckpt_enc.cc
> +++ b/osaf/services/saf/amf/amfd/ckpt_enc.cc
> @@ -2214,7 +2214,7 @@ static uint32_t enc_cs_siass(AVD_CL_CB *
>   su = it->second;
>   
>   for (rel = su->list_of_susi; rel != nullptr; rel = 
> rel->su_next) {
> - memcpy(©, rel, sizeof(AVD_SU_SI_REL));
> + copy = *rel;
>   copy.csi_add_rem = SA_FALSE;
>   encode_siass(&enc->io_uba, ©, enc->i_peer_version);
>   (*num_of_obj)++;
> diff --git a/osaf/services/saf/amf/amfd/ndproc.cc 
> b/osaf/services/saf/amf/amfd/ndproc.cc
> --- a/osaf/services/saf/amf/amfd/ndproc.cc
> +++ b/osaf/services/saf/amf/amfd/ndproc.cc
> @@ -310,8 +310,8 @@ void avd_nd_sisu_state_info_evh(AVD_CL_C
>   
>   if (cb->node_sync_window_closed == false) {
>   state_info_evt = new AVD_EVT_QUEUE();
> - state_info_evt->evt = new AVD_EVT();
> - memcpy(state_info_evt->evt, evt, sizeof(AVD_EVT));
> + state_info_evt->evt = new AVD_EVT{};
> + state_info_evt->evt->rcv_evt = evt->rcv_evt;
>   state_info_evt->evt->info.avnd_msg = n2d_msg;
>   cb->evt_queue.push(state_info_evt);
>   }
> @@ -354,8 +354,8 @@ void avd_nd_compcsi_state_info_evh(AVD_C
>   
>   if (cb->node_sync_window_closed == false) {
>   state_info_evt = new AVD_EVT_QUEUE();
> - state_info_evt->evt = new AVD_EVT();
> - memcpy(state_info_evt->evt, evt, sizeof(AVD_EVT));
> + state_info_evt->evt = new AVD_EVT{};
> + state_info_evt->evt->rcv_evt = evt->rcv_evt;
>   state_info_evt->evt->info.avnd_msg = n2d_msg;
>   cb->evt_queue.push(state_info_evt);
>   }
> diff --git a/osaf/services/saf/amf/amfd/role.cc 
> b/osaf/services/saf/amf/amfd/role.cc
> --- a/osaf/services/saf/amf/amfd/role.cc
> +++ b/osaf/services/saf/amf/amfd/role.cc
> @@ -556,8 +556,7 @@ static uint32_t avd_role_failover_qsd_ac
>* Post an evt on mailbox to set active role to all NCS 
> SU
>*
>*/
> - AVD_EVT evt;
> - memset(&evt, '\0', sizeof(AVD_EVT));
> + AVD_EVT evt = {};
>   evt.rcv_evt = AVD_EVT_SWITCH_NCS_SU;
>   
>   /* set cb state to active */
>


--
___
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-08-30 Thread minh chau
Hi Praveen

I think @admin_ng needs to be restore as well as 
@ng_using_saAmfSGAdminState, since just found that @admin_ng check is 
required in avd_node_down_appl_susi_failover(), which happens if node 
having pending csi callback reboot.
I will use SG_FSM_ADMIN to differentiate cases 1 and 2.

Thanks,
Minh
On 29/08/16 16:25, praveen malviya wrote:
> Hi Minh,
>
> Please see inline with [Praveen]
>
> Thanks,
> Praveen
>
> On 29-Aug-16 5:57 AM, minh chau wrote:
>> Hi Praveen,
>>
>> Thanks for looking through the patch.
>> The potential problem of restoring nodegroup because nodegroup allows to
>> be created in LOCKED while the SUs are having assignment, this could
>> cause an ambiguity for AMFD after headless. For example:
>> Suppose having SU4 hosted on PL4, SU5 hosted on PL5, SU4 has active
>> assignment, SU5 has standby assignment.
>> case 1: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock PL4,
>> delay quiesced csi cbk, stop SC, restart SC.
> [Praveen] In this case, after headless state SG fsm will not be in 
> SG_ADMIN state because payload are being locked one by one. So in this 
> case it is distinguishable that it is a not a NG operation case
> as SG is not in SG_ADMIN state even though SG is fully assigned in NG.
>> case 2: Create nodegroup (PL4 + PL5) with LOCKED, lock nodegroup, delay
>> quiesced csi cbk, stop SC, restart SC.
> [Praveen] In this case we have following information after headless 
> state:
> -SG is in SG_ADMIN state.
> -NG is in SHUTTING_DOWN or LOCKED state.
> -Nodes in SHUTTING_DOWN or LOCKED state.
> -SG FSM remains in SG_ADMIN state only in case of admin operation 
> on SG. But after headless SG is not found in UNLOCKED state and one NG 
> is found in LOCKED/SHUTTING down state and its nodes.
> I think with above information, AMFD can set 
> @ng_using_saAmfSGAdminState and set SG admin state to SHUTTING_DOWN or 
> LOCKED. Restoring admin_ng is not required as in su_si_assign(), there 
> is an OR condition between @ng_using_saAmfSGAdminState and @admin_ng 
> for calling process_su_si_response_for_ng(). Also checks on admin_ng 
> is used only for updating counters related to completion of admin 
> opeations which is not required after headless.
>
>>
>> if case 2 actually happened before headless, then @admin_ng and
>> @ng_using_saAmfSGAdminState needs to be restored, otherwise
>> process_su_si_response_for_ng() won't be called and saAmfSGAdminState
>> remains LOCKED and SG is still not STABLE state.
>>
>> But in both cases, after headless, AMFD sees all PLs are LOCKED,
>> nodegroup is LOCKED, SU4 has pending quiesced csi cbk, thus they are
>> running into the same code flow. In case 1, @admin_ng and
>> @ng_using_saAmfSGAdminState should not bet set since case1 was not
>> nodegroup operation before headless.
>> I have run a test of both cases, they are working with the patch
>> attached in ticket, but it still looks a potential problem since all
>> cases are not transparent to AMFD after headless, the @admin_ng and
>> @ng_using_asAmfSGAdminState maybe get hit in some points in case 1
> [Praveen] Besides above cases, there remains only one case: when 
> operation was initiated on NG and SG is partially mapped in NG. In 
> this case, after headless state we can get only two states of SG 
> either SU_OPER or SG_REALIGN. In both the cases I think we do not 
> require to restore  @ng_using_saAmfSGAdminState and @admin_ng because 
> we do not require to enter in process_su_si_response_for_ng().In 
> sg_2n_fsm, it marks Node from SHUTTING_DOWN to locked in 
> susi_success_su_oper().
>
> Have I missed any other case?
>
>>
>> If case 1 looks ok to you from nodegroup point of view, then I will
>> float the patch for review.
>>
>> Thanks,
>> Minh
>>
>>
>> On 26/08/16 16:08, praveen malviya wrote:
>>> Hi,
>>>
>>> I have gone through amfd traces. Also patch for NG seems to be ok but
>>> some minor can be done.
>>>
>>> As pointed by Minh, when whole SG is mapped in NG (say case a), AMFD
>>> uses SG_ADMIN flow and SG admin state without exposing it to the user
>>> through IMM for 2N model. In the other case when only one SU is
>>> assigned in NG (say case b) there should not be any problem because
>>> operation fully depends on NG admin state. Since other case b) does
>>> not use SG admin state and ng_using_saAMfSGAdminState, it should work
>>> fine.
>>>
>>> I think we can take the help of following facts and functions to
>>> improve the patch and with that restoring ng_using_saAmfSGAdminState
>&g

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-09-04 Thread minh chau
Hi Praveen,

I think in all cases we need to restore @admin_ng, that can tell whether 
an admin operation on ng was executing. Also, we need to restore 
ng_using_saAmfSGAdminState in case 2N borrows SaAmfSGAdminState for 
nodegroup. If we can restore exactly what was happening before headless, 
then the operation continuation should work. However, there's still the 
cases like:
case 1: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock PL4, 
delay quiesced csi cbk, stop SC, restart SC.
case 2: Create nodegroup (PL4 + PL5) with LOCKED, lock nodegroup, delay 
quiesced csi cbk, stop SC, restart SC.
case 3: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock 
nodegroup, delay quiesced csi cbk, stop SC, restart SC.

Case 1 and 2 has been cleared out previously, case 1 and 3 leave all 
states the same: SG_SU_OPER, node: LOCKED, ng: LOCKED. But if AMF views 
case 3 in the shape of case 1, then nodegroup operation continuation 
also works fine. I have sent out the patch for reivew, please check.

Thanks,
Minh

On 01/09/16 18:11, praveen malviya wrote:
> Hi Minh,
> Please see response inline with [Praveen]
>
> Thanks,
> Praveen
>
>
> On 31-Aug-16 11:19 AM, minh chau wrote:
>> Hi Praveen
>>
>> I think @admin_ng needs to be restore as well as
>> @ng_using_saAmfSGAdminState, since just found that @admin_ng check is
>> required in avd_node_down_appl_susi_failover(), which happens if node
>> having pending csi callback reboot.
> [Praveen] Inside avd_node_down_appl_susi_failover(), reason of calling 
> process_su_si_response_for_ng() in non-headless case is :
> -If this happens to be the last node in the nodegroup where node group 
> operation was going on, then migrate node/NG from SHUTTING_DOWN to 
> locked,clear ng_using_saAmfSGAdminState and respond to IMM because we 
> will not get any su_si_assign() event to trigger it agian.
> -Even if this is not the last node, then also this nodes should move 
> from SHUTTING_DOWN to locked state again because there will not be any 
> further trigger (when whole SG is mapped in NG).
>
> Can we think of a possibility of making it OR with 
> @ng_using_saAmfSGAdminState here also in 
> avd_node_down_appl_susi_failover() ?
>
> A)With this small fix in headless case: After headless if again some 
> node faults then AMFD will still be calling 
> process_su_si_response_for_ng() for whole SG mapped case.
>
> But what about other case when whole SG is not mapped.In this case 
> node_fail_su_oper() would have marked atleast node from SHUTTING_DOWN 
> to LOCKED state. Then how to mark NG from SHUTTING_DOWN to LOCKED 
> because admin_ng is NULL and ng_using_saAMFSGAdminState is not set for 
> this case. I think this can be done deductively using headless state 
> vairable and other facts like node->admin_node_pend_cbk.invocation and 
> ng->admin_ng_pend_cbk.admin_oper after headless to cross verify 
> whether this is really the context of admin operation before headless 
> or after headless. Based on this deduction, we need to mark node from 
> SHUTING_DON to LOCKED.
>
> b)With this small fix in non-headless case: There should not be any 
> impact because admin_ng is not NULL and AMFD was already making a call.
>
>
> Thanks,
> Praveen
>> I will use SG_FSM_ADMIN to differentiate cases 1 and 2.
>>
>> Thanks,
>> Minh
>> On 29/08/16 16:25, praveen malviya wrote:
>>> Hi Minh,
>>>
>>> Please see inline with [Praveen]
>>>
>>> Thanks,
>>> Praveen
>>>
>>> On 29-Aug-16 5:57 AM, minh chau wrote:
>>>> Hi Praveen,
>>>>
>>>> Thanks for looking through the patch.
>>>> The potential problem of restoring nodegroup because nodegroup 
>>>> allows to
>>>> be created in LOCKED while the SUs are having assignment, this could
>>>> cause an ambiguity for AMFD after headless. For example:
>>>> Suppose having SU4 hosted on PL4, SU5 hosted on PL5, SU4 has active
>>>> assignment, SU5 has standby assignment.
>>>> case 1: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock PL4,
>>>> delay quiesced csi cbk, stop SC, restart SC.
>>> [Praveen] In this case, after headless state SG fsm will not be in
>>> SG_ADMIN state because payload are being locked one by one. So in this
>>> case it is distinguishable that it is a not a NG operation case
>>> as SG is not in SG_ADMIN state even though SG is fully assigned in NG.
>>>> case 2: Create nodegroup (PL4 + PL5) with LOCKED, lock nodegroup, 
>>>> delay
>>>> quiesced csi cbk, stop SC, restart SC.
>>> [Praveen] In this case we have following information after headless
>>> sta

Re: [devel] [PATCH 2 of 2] AMFND: Admin operation continuation if csi callback completes during headless [#1725 part 1] V1

2016-09-07 Thread minh chau
Hi Praveen,

I have checked my test cases, case 2 and 3 are created with UNLOCKED 
state, but both case 1 and 3 still leave all states in the same shadow 
after headless.

Thanks,
Minh

On 07/09/16 22:46, praveen malviya wrote:
> Hi Minh,
>
> Please find one query below.
>
> Thanks,
> Praveen
>
> On 05-Sep-16 5:07 AM, minh chau wrote:
>> Hi Praveen,
>>
>> I think in all cases we need to restore @admin_ng, that can tell whether
>> an admin operation on ng was executing. Also, we need to restore
>> ng_using_saAmfSGAdminState in case 2N borrows SaAmfSGAdminState for
>> nodegroup. If we can restore exactly what was happening before headless,
>> then the operation continuation should work. However, there's still the
>> cases like:
>> case 1: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock PL4,
>> delay quiesced csi cbk, stop SC, restart SC.
> [Praveen] I think this case is already discussed.
>> case 2: Create nodegroup (PL4 + PL5) with LOCKED, lock nodegroup, delay
>> quiesced csi cbk, stop SC, restart SC.
> [Praveen] Here NG is created in locked state already then lock of 
> nodegroup can performed.
>> case 3: Create nodegroup (PL4 + PL5) with LOCKED, lock PL5, lock
>> nodegroup, delay quiesced csi cbk, stop SC, restart SC.
> [Praveen] Here also NG is being created in locked state then lock of 
> NG cannot be performed.
>>
>> Case 1 and 2 has been cleared out previously, case 1 and 3 leave all
>> states the same: SG_SU_OPER, node: LOCKED, ng: LOCKED. But if AMF views
>> case 3 in the shape of case 1, then nodegroup operation continuation
>> also works fine. I have sent out the patch for reivew, please check.
>>
>> Thanks,
>> Minh
>>
>> On 01/09/16 18:11, praveen malviya wrote:
>>> Hi Minh,
>>> Please see response inline with [Praveen]
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>> On 31-Aug-16 11:19 AM, minh chau wrote:
>>>> Hi Praveen
>>>>
>>>> I think @admin_ng needs to be restore as well as
>>>> @ng_using_saAmfSGAdminState, since just found that @admin_ng check is
>>>> required in avd_node_down_appl_susi_failover(), which happens if node
>>>> having pending csi callback reboot.
>>> [Praveen] Inside avd_node_down_appl_susi_failover(), reason of calling
>>> process_su_si_response_for_ng() in non-headless case is :
>>> -If this happens to be the last node in the nodegroup where node group
>>> operation was going on, then migrate node/NG from SHUTTING_DOWN to
>>> locked,clear ng_using_saAmfSGAdminState and respond to IMM because we
>>> will not get any su_si_assign() event to trigger it agian.
>>> -Even if this is not the last node, then also this nodes should move
>>> from SHUTTING_DOWN to locked state again because there will not be any
>>> further trigger (when whole SG is mapped in NG).
>>>
>>> Can we think of a possibility of making it OR with
>>> @ng_using_saAmfSGAdminState here also in
>>> avd_node_down_appl_susi_failover() ?
>>>
>>> A)With this small fix in headless case: After headless if again some
>>> node faults then AMFD will still be calling
>>> process_su_si_response_for_ng() for whole SG mapped case.
>>>
>>> But what about other case when whole SG is not mapped.In this case
>>> node_fail_su_oper() would have marked atleast node from SHUTTING_DOWN
>>> to LOCKED state. Then how to mark NG from SHUTTING_DOWN to LOCKED
>>> because admin_ng is NULL and ng_using_saAMFSGAdminState is not set for
>>> this case. I think this can be done deductively using headless state
>>> vairable and other facts like node->admin_node_pend_cbk.invocation and
>>> ng->admin_ng_pend_cbk.admin_oper after headless to cross verify
>>> whether this is really the context of admin operation before headless
>>> or after headless. Based on this deduction, we need to mark node from
>>> SHUTING_DON to LOCKED.
>>>
>>> b)With this small fix in non-headless case: There should not be any
>>> impact because admin_ng is not NULL and AMFD was already making a call.
>>>
>>>
>>> Thanks,
>>> Praveen
>>>> I will use SG_FSM_ADMIN to differentiate cases 1 and 2.
>>>>
>>>> Thanks,
>>>> Minh
>>>> On 29/08/16 16:25, praveen malviya wrote:
>>>>> Hi Minh,
>>>>>
>>>>> Please see inline with [Praveen]
>>>>>
>>>>> Thanks,
>>>>> Praveen
>>>>>
>&

  1   2   3   4   5   >