Hi Hoang , Can you please help me understating the issue , based on Reproduce step, I am not able to relate the fix (patch) with the original problem.
On 1/20/2017 10:04 AM, Vo Minh Hoang wrote: > Reproduce steps: > - in SC1 open a checkpoint (non col) > - set retention value to big: 99999 > - reboot SC1 by killing osafamfd > - reboot SC2 by killing osafamfd > > Check retention value and it is default value (60) not set one. 1) Having a 2 controller setup and open a checkpoint (non col) only on SC1 with big retention value to : 99999, then unlinked that ckpt but not close or finalized by application and hold the application. This means unlinked state is MBCSV check-pointed to SC2 by SC1 . 2) At that moment rebooted SC1 by killing osafamfd , this means cpa down ( agent down) and this is not being detected by new active SC2 CPD, so the unlinked replica/ckpt on SC2 is not being deleted even NO cpa ( user count zero for this unlinked ckpt), and the retention timer being run with that big value. 4) If we do any number of fail-overs the data of this ckpt will not change and this ckpt will not deleted until the retention time expires 3) so let us target this root cause , when SC1 is down CPD on SC2 becoming new active we should have some logic to detect the unliked no-collocated ckpts with NO user and be able to deleted/close that ckpt related data before SC1 re-joining as new Standby CPD. 4) Fixing the issue as when error occur delete the existing checkpoint IMM object will not clean up the data exist in cpsv database. So let us target the root cause fix. -AVM On 1/20/2017 10:04 AM, Vo Minh Hoang wrote: > Dear Mahesh, > > I remove trace log only, logical code is similar. > My mistake made V4 has no different to V3 :( > > Reproduce steps: > - in SC1 open a checkpoint (non col) > - set retention value to big: 99999 > - reboot SC1 by killing osafamfd > - reboot SC2 by killing osafamfd > > Check retention value and it is default value (60) not set one. > > Thank you and best regards, > Hoang > > -----Original Message----- > From: A V Mahesh [mailto:[email protected]] > Sent: Friday, January 20, 2017 11:08 AM > To: Vo Minh Hoang <[email protected]>; [email protected] > Subject: Re: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of > cpsv [#1765] > > Hi Hoang, > > Is this attached one different form published V4 ? > > And to reproduce the problem i am following below steps to verify the > patch > > please let me know if i am missing any thing. > > >>Create checkpoint and set retention to big value Failover by >>killing > osafamfd multiple times Check checkpoint information > > -AVM > > On 1/19/2017 2:04 PM, Vo Minh Hoang wrote: >> Dear Mahesh, >> >> Please review by this attached patch. >> The submitted one is not refreshed so it contains old source code. >> >> I am sorry for any inconvenient. >> >> Sincerely, >> Hoang >> >> -----Original Message----- >> From: Hoang Vo [mailto:[email protected]] >> Sent: Thursday, January 19, 2017 3:21 PM >> To: [email protected]; [email protected] >> Cc: [email protected] >> Subject: [devel] [PATCH 1 of 1] cpd: to correct failover behavior of >> cpsv [#1765] >> >> src/ckpt/ckptd/cpd_db.c | 15 +++++++++++++++ >> src/ckpt/ckptd/cpd_evt.c | 12 ++++++++---- >> src/ckpt/ckptd/cpd_proc.c | 18 ++++++++++++++++-- >> src/ckpt/ckptnd/cpnd_evt.c | 3 ++- >> src/ckpt/ckptnd/cpnd_proc.c | 12 +++++++++--- >> 5 files changed, 50 insertions(+), 10 deletions(-) >> >> >> problem: >> In case a failover happens while a checkpoint is being unlinked, it >> might causes an unfinished unlink operation (i.e the checkpoint IMM >> object is not deleted). Later on, when the checkpoint is created >> again, it will not succeed because the CPD detects that the checkpoint IMM > object existing. >> Fix: >> - When error occur delete the existing checkpoint IMM object and >> re-create new one. >> - Stop timer of removed node. >> - Update data in patricia trees. >> >> diff --git a/src/ckpt/ckptd/cpd_db.c b/src/ckpt/ckptd/cpd_db.c >> --- a/src/ckpt/ckptd/cpd_db.c >> +++ b/src/ckpt/ckptd/cpd_db.c >> @@ -104,6 +104,21 @@ uint32_t cpd_ckpt_node_add(NCS_PATRICIA_ >> /*create the imm runtime object */ >> if (ha_state == SA_AMF_HA_ACTIVE) { >> err = create_runtime_ckpt_object(ckpt_node, immOiHandle); >> + >> + /* The Checkpoint IMM object exist due to unfinished >> previous opernation (e.g unlink) >> + * The action is to delete the old object and create a new >> one */ >> + >> + if (err == SA_AIS_ERR_EXIST) { >> + LOG_WA("cpd ckpt node add - the IMM object exits >> %s", >> +ckpt_node->ckpt_name); >> + >> + if (delete_runtime_ckpt_object(ckpt_node, >> immOiHandle) != SA_AIS_OK) { >> + LOG_ER("Deleting run time object %s FAILED", >> ckpt_node->ckpt_name); >> + return NCSCC_RC_FAILURE; >> + } >> + >> + err = create_runtime_ckpt_object(ckpt_node, >> immOiHandle); >> + } >> + >> if (err != SA_AIS_OK) { >> LOG_ER("create runtime ckpt object failed with >> error: %u",err); >> if (err == SA_AIS_ERR_INVALID_PARAM) { diff --git >> a/src/ckpt/ckptd/cpd_evt.c b/src/ckpt/ckptd/cpd_evt.c >> --- a/src/ckpt/ckptd/cpd_evt.c >> +++ b/src/ckpt/ckptd/cpd_evt.c >> @@ -310,13 +310,17 @@ static uint32_t cpd_evt_proc_ckpt_create >> if (send_evt.info.cpnd.info.ckpt_info.is_active_exists) >> send_evt.info.cpnd.info.ckpt_info.active_dest = >> ckpt_node->active_dest; >> >> - if (map_info->attributes.creationFlags & >> SA_CKPT_CHECKPOINT_COLLOCATED) >> + if (map_info->attributes.creationFlags & >> SA_CKPT_CHECKPOINT_COLLOCATED) { >> + TRACE("======HOANG====== ckpt_rep_create = true"); >> send_evt.info.cpnd.info.ckpt_info.ckpt_rep_create = > true; >> - else { >> - if (is_first_rep) >> + } else { >> + if (is_first_rep) { >> + TRACE("======HOANG====== ckpt_rep_create = >> true"); >> >> send_evt.info.cpnd.info.ckpt_info.ckpt_rep_create = true; >> - else >> + } else { >> + TRACE("======HOANG====== ckpt_rep_create = >> false"); >> >> send_evt.info.cpnd.info.ckpt_info.ckpt_rep_create = false; >> + } >> } >> >> if (ckpt_node->dest_cnt) { >> diff --git a/src/ckpt/ckptd/cpd_proc.c b/src/ckpt/ckptd/cpd_proc.c >> --- a/src/ckpt/ckptd/cpd_proc.c >> +++ b/src/ckpt/ckptd/cpd_proc.c >> @@ -348,7 +348,8 @@ uint32_t cpd_ckpt_db_entry_update(CPD_CB >> proc_rc = >> cpd_ckpt_reploc_node_add(&cb->ckpt_reploc_tree, reploc_info, >> cb->ha_state, >> cb->immOiHandle); >> if (proc_rc != NCSCC_RC_SUCCESS) { >> /* goto reploc_node_add_fail; */ >> - TRACE_4("cpd db add failed "); >> + LOG_ER("cpd db replica add failed "); >> + goto replica_node_add_fail; >> } >> } >> >> @@ -367,6 +368,10 @@ uint32_t cpd_ckpt_db_entry_update(CPD_CB >> TRACE_LEAVE(); >> return NCSCC_RC_SUCCESS; >> >> + replica_node_add_fail: >> + cpd_ckpt_node_delete(cb, ckpt_node); >> + ckpt_node = NULL; >> + >> ckpt_node_add_fail: >> cpd_ckpt_map_node_delete(cb, map_info); >> map_info = NULL; >> @@ -679,7 +684,8 @@ uint32_t cpd_process_cpnd_down(CPD_CB *c >> cpd_cpnd_info_node_find_add(&cb->cpnd_tree, cpnd_dest, &cpnd_info, >> &add_flag); >> if (!cpnd_info) >> return NCSCC_RC_SUCCESS; >> - >> + /* Stop timer before processing down */ >> + cpd_tmr_stop(&cpnd_info->cpnd_ret_timer); >> cref_info = cpnd_info->ckpt_ref_list; >> >> while (cref_info) { >> @@ -989,6 +995,14 @@ uint32_t cpd_proc_retention_set(CPD_CB * >> >> /* Update the retention Time */ >> (*ckpt_node)->ret_time = reten_time; >> + (*ckpt_node)->attributes.retentionDuration = reten_time; >> + >> + /* Update the related patricia tree */ >> + CPD_CKPT_MAP_INFO *map_info = NULL; >> + cpd_ckpt_map_node_get(&cb->ckpt_map_tree, (*ckpt_node)->ckpt_name, >> &map_info); >> + if (map_info) { >> + map_info->attributes.retentionDuration = reten_time; >> + } >> return rc; >> } >> >> diff --git a/src/ckpt/ckptnd/cpnd_evt.c b/src/ckpt/ckptnd/cpnd_evt.c >> --- a/src/ckpt/ckptnd/cpnd_evt.c >> +++ b/src/ckpt/ckptnd/cpnd_evt.c >> @@ -4066,10 +4066,11 @@ static uint32_t cpnd_evt_proc_ckpt_destr >> >> if (cb->num_rep) { >> cb->num_rep--; >> + TRACE("------------NUMREP----------- %d", >> cb->num_rep); >> } >> >> >> m_MMGR_FREE_CPND_DEFAULT(cp_node->replica_info.open.info.open.i_name); >> - /* freeing the sec_mapping memory */ >> + /* freeing the sec_mapping mem-ory */ >> >> m_MMGR_FREE_CPND_DEFAULT(cp_node->replica_info.shm_sec_mapping); >> >> } >> diff --git a/src/ckpt/ckptnd/cpnd_proc.c b/src/ckpt/ckptnd/cpnd_proc.c >> --- a/src/ckpt/ckptnd/cpnd_proc.c >> +++ b/src/ckpt/ckptnd/cpnd_proc.c >> @@ -377,8 +377,10 @@ uint32_t cpnd_ckpt_replica_destroy(CPND_ >> return rc; >> } >> >> - if (cb->num_rep) >> + if (cb->num_rep) { >> cb->num_rep--; >> + TRACE("------------NUMREP----------- %d", >> cb->num_rep); >> + } >> >> >> m_MMGR_FREE_CPND_DEFAULT(cp_node->replica_info.open.info.open.i_name); >> >> @@ -502,8 +504,10 @@ uint32_t cpnd_ckpt_replica_create(CPND_C >> } >> } >> >> - if (cp_node->replica_info.open.info.open.i_flags & O_CREAT) >> + if (cp_node->replica_info.open.info.open.i_flags & O_CREAT) { >> cb->num_rep++; >> + TRACE("++++++++++++++NUMREP+++++++++++++ %d", cb->num_rep); >> + } >> >> cp_node->replica_info.shm_sec_mapping = >> (uint32_t *)m_MMGR_ALLOC_CPND_DEFAULT(sizeof(uint32_t) * >> cp_node->create_attrib.maxSections); >> @@ -2659,8 +2663,10 @@ void cpnd_ckpt_replica_delete(CPND_CB *c >> return; >> } >> >> - if (cb->num_rep) >> + if (cb->num_rep) { >> cb->num_rep--; >> + TRACE("------------NUMREP----------- %d", >> cb->num_rep); >> + } >> >> >> m_MMGR_FREE_CPND_DEFAULT(ckpt_node->replica_info.open.info.open.i_name >> ); >> >> >> ---------------------------------------------------------------------- >> ------ >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> _______________________________________________ >> Opensaf-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
