[tickets] [opensaf:tickets] #2415 CKPT node director failed to execute ckpt create request

2017-04-06 Thread A V Mahesh (AVM)
- **Milestone**: 5.2.0 --> next



---

** [tickets:#2415] CKPT node director failed to execute ckpt create request**

**Status:** assigned
**Milestone:** next
**Created:** Fri Apr 07, 2017 01:30 AM UTC by David Byrne
**Last Updated:** Fri Apr 07, 2017 03:54 AM UTC
**Owner:** A V Mahesh (AVM)


After the following two patches were removed, based on OpenSAF CS8701, CKPT 
node director failed to execute ckpt create request(Collocated Checkpoints, 
Asynchronous Update).
-ph4_01_headless_escalation_for_osaftest.diff
-mds_log_level.diff

CPND_MAX_REPLICAS =1000
retention_time is set to 30s

Test procedure
1. Send 34 ckpt request per second
34*30 = 1020 which is > CPND_MAX_REPLICAS
Failed which is expected
2. Send 32 ckpt request per second
32*30 = 960 which is < CPND_MAX_REPLICAS
It used to pass, but now failed since removing the above two patches.
syslog:
Apr  5 01:42:46 SC-2-1 osafckptnd[4958]: ncs_sel_obj_create: socketpair failed 
- Too many open files
Apr  5 01:42:46 SC-2-1 osafckptnd[4958]: ER cpnd has exceeded the maximum 
number of allowed replicas (CPND_MAX_REPLICAS)
Test debug info:
Apr 5, 2017 1:46:08 AM INFO ANSWER type: report 
start-time: 1491349366.360 
stop-time: 1491349567.269 
total: send=6428 recv=6407 fail=6407

Change test procedure for investigation purpose
1. Start test from 32 ckpt/s
32*30 = 960 which is  < CPND_MAX_REPLICAS
Passed
Apr 6, 2017 2:56:27 AM INFO ANSWER type: report 
start-time: 1491439975.068 
stop-time: 1491440187.347 
total: send=6792 send-failed=0 recv=6780  
2. then test 34 ckpt/s
Failed
3. Then test 33 ckpt/s
Failed
4. Then back to 32 ckpt/s again
Failed

From this experiment, we can see that once exceed the CPND_MAX_REPLICAS, ckpt 
service can’t be recovered. 
Note: the problem only occurs for Collocated Checkpoints, Asynchronous Update. 
Run the same test for Non-Collocated Checkpoints, Synchronous Update, it is OK.

Test Contact: Li Suo


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2395 CKPT: Performance degradation ~100% (Time taken is almost double than previous)

2017-04-06 Thread A V Mahesh (AVM)
On 4/6/2017 5:29 PM, Chani Srivastava wrote:
> With this patch the performance figures shows great improvement then before 
> and the results are > >comparable to 5.1 results


Thanks for the testing. 

This patch provides the option of rollback  way to configure CKPT to get the 
old behavior as 5.1
so statistics will match 5.1


---

** [tickets:#2395] CKPT: Performance degradation ~100% (Time taken is almost 
double than previous)**

**Status:** review
**Milestone:** 5.2.0
**Created:** Thu Mar 23, 2017 10:26 AM UTC by Chani Srivastava
**Last Updated:** Thu Apr 06, 2017 11:26 AM UTC
**Owner:** A V Mahesh (AVM)


Environment details

OS : Suse 11, 64bit Physical machine 
Changeset : 8634 ( 5.2.FC)
Setup : 4 nodes

There is considerable degradation in CKPT performance in 5.2 when compared to 
5.1. The times are calculated just before api and after api for which time 
difference is calculated.

-> For write operations, checkpoint write api is taking 2x the time taken in 
earlier release 5.1. Issue is observed in both synchronous and asynchronous 
mode.
( synchronous -- Checkpoint create flags used : SA_CKPT_WR_ALL_REPLICAS
asynchronous -- Checkpoint create flag used : SA_CKPT_WR_ACTIVE_REPLICA | 
SA_CKPT_CHECKPOINT_COLLOCATED ) Both local and remote replica

-> For section create operations in asynchronous mode for local replica, 
checkpoint section create api is taking more than 70% the earlier value in 5.1

-> For read operations in asynchronous mode for local replica, checkpoint read 
api is taking twice the time than in earlier value in 5.1

Please check the tickets pushed as part of 4.7 to 5.0, for which API 
performance got affected.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2415 CKPT node director failed to execute ckpt create request

2017-04-06 Thread David Byrne



---

** [tickets:#2415] CKPT node director failed to execute ckpt create request**

**Status:** unassigned
**Milestone:** 5.2.0
**Created:** Fri Apr 07, 2017 01:30 AM UTC by David Byrne
**Last Updated:** Fri Apr 07, 2017 01:30 AM UTC
**Owner:** nobody


After the following two patches were removed, based on OpenSAF CS8701, CKPT 
node director failed to execute ckpt create request(Collocated Checkpoints, 
Asynchronous Update).
-ph4_01_headless_escalation_for_osaftest.diff
-mds_log_level.diff

CPND_MAX_REPLICAS =1000
retention_time is set to 30s

Test procedure
1. Send 34 ckpt request per second
34*30 = 1020 which is > CPND_MAX_REPLICAS
Failed which is expected
2. Send 32 ckpt request per second
32*30 = 960 which is < CPND_MAX_REPLICAS
It used to pass, but now failed since removing the above two patches.
syslog:
Apr  5 01:42:46 SC-2-1 osafckptnd[4958]: ncs_sel_obj_create: socketpair failed 
- Too many open files
Apr  5 01:42:46 SC-2-1 osafckptnd[4958]: ER cpnd has exceeded the maximum 
number of allowed replicas (CPND_MAX_REPLICAS)
Test debug info:
Apr 5, 2017 1:46:08 AM INFO ANSWER type: report 
start-time: 1491349366.360 
stop-time: 1491349567.269 
total: send=6428 recv=6407 fail=6407

Change test procedure for investigation purpose
1. Start test from 32 ckpt/s
32*30 = 960 which is  < CPND_MAX_REPLICAS
Passed
Apr 6, 2017 2:56:27 AM INFO ANSWER type: report 
start-time: 1491439975.068 
stop-time: 1491440187.347 
total: send=6792 send-failed=0 recv=6780  
2. then test 34 ckpt/s
Failed
3. Then test 33 ckpt/s
Failed
4. Then back to 32 ckpt/s again
Failed

From this experiment, we can see that once exceed the CPND_MAX_REPLICAS, ckpt 
service can’t be recovered. 
Note: the problem only occurs for Collocated Checkpoints, Asynchronous Update. 
Run the same test for Non-Collocated Checkpoints, Synchronous Update, it is OK.

Test Contact: Li Suo


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2413 smf: coredump, suspend is issued at completed state

2017-04-06 Thread Alex Jones
- **status**: accepted --> review
- **Comment**:

A suspend is actually not being issued here. The state machine code is 
implemented such that the suspend is only done in states Executing, Suspending, 
or RollingBack.

After getting some more logs from Rafael, it is clear this is a race condition 
between an async failure in AMF and the campaign commit being executed. Here is 
what is happening:

Campaign commit is performed. Before smfd clears the suMaintenanceCampaign 
attribute for the SU, a component in that SU fails. This sends an NTF event 
with the maintenance name. At the same time the poll routine in smfd processes 
the TERMINATE upgrade thread event. When it returns, the upgrade campaign 
thread has been deleted and m_running has been set to false. But, the NTF file 
descriptor has not been processed yet. Now, the poll routine processes the NTF 
event which tries to use the upgrade thread to deliver the asyncFailure event, 
which is gone. Hence the crash.

The solution should be to always have "processEvt" last in the poll routine, so 
that if m_running is set to false, no other processing will be done, and the 
poll loop will finish.



---

** [tickets:#2413] smf: coredump, suspend is issued at completed state**

**Status:** review
**Milestone:** 5.2.0
**Created:** Wed Apr 05, 2017 12:39 PM UTC by Rafael
**Last Updated:** Thu Apr 06, 2017 03:33 PM UTC
**Owner:** Alex Jones
**Attachments:**

- 
[osafsmfd.9276.SC-2.core.txt](https://sourceforge.net/p/opensaf/tickets/2413/attachment/osafsmfd.9276.SC-2.core.txt)
 (15.4 kB; text/plain)


ticket #2145 looks to be causing this issue. 

coredump printout is attached.

Steps to reproduce: run a campaign and have AMF compenent fail at the campaign 
completed state. This triggers a event in SMF which tries to suspend a 
completed campaign.

Function handleAmfObjectStateChangeNotification will try to call asyncFailure() 
which is the same as suspend() because the campaign is completed and commited 
this is not a valid transition. The campaign state instance is most likely 
deleted therefore we get a coredump.

For reference refer to figures 5, 6, 7 in SMF AIS. Starting from section 5.1.3


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2413 smf: coredump, suspend is issued at completed state

2017-04-06 Thread Alex Jones
- **status**: unassigned --> accepted
- **assigned_to**: Alex Jones
- **Comment**:

I was able to reproduce the problem. It is a race condition between async 
failure in AMF on the upgraded SU and the commit being processed in smfd.




---

** [tickets:#2413] smf: coredump, suspend is issued at completed state**

**Status:** accepted
**Milestone:** 5.2.0
**Created:** Wed Apr 05, 2017 12:39 PM UTC by Rafael
**Last Updated:** Thu Apr 06, 2017 11:58 AM UTC
**Owner:** Alex Jones
**Attachments:**

- 
[osafsmfd.9276.SC-2.core.txt](https://sourceforge.net/p/opensaf/tickets/2413/attachment/osafsmfd.9276.SC-2.core.txt)
 (15.4 kB; text/plain)


ticket #2145 looks to be causing this issue. 

coredump printout is attached.

Steps to reproduce: run a campaign and have AMF compenent fail at the campaign 
completed state. This triggers a event in SMF which tries to suspend a 
completed campaign.

Function handleAmfObjectStateChangeNotification will try to call asyncFailure() 
which is the same as suspend() because the campaign is completed and commited 
this is not a valid transition. The campaign state instance is most likely 
deleted therefore we get a coredump.

For reference refer to figures 5, 6, 7 in SMF AIS. Starting from section 5.1.3


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2414 amf: Support NoRed model for OpenSAF directors

2017-04-06 Thread Anders Widell



---

** [tickets:#2414] amf: Support NoRed model for OpenSAF directors**

**Status:** assigned
**Milestone:** next
**Created:** Thu Apr 06, 2017 01:32 PM UTC by Anders Widell
**Last Updated:** Thu Apr 06, 2017 01:32 PM UTC
**Owner:** Anders Widell
**Attachments:**

- 
[nored.diff.gz](https://sourceforge.net/p/opensaf/tickets/2414/attachment/nored.diff.gz)
 (2.0 kB; application/gzip)


Currently, the OpenSAF directors can only be configured with the 2N redundancy 
models. The proposal is to also make it possible to configured them with the 
No-Redundancy model.

The attached patch is a simple prototype that makes it possible to use the 
No-Redundancy model.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] Re: #2395 CKPT: Performance degradation ~100% (Time taken is almost double than previous)

2017-04-06 Thread Chani Srivastava
With this patch the performance figures shows great improvement then before and 
the results are comparable to 5.1 results


---

** [tickets:#2395] CKPT: Performance degradation ~100% (Time taken is almost 
double than previous)**

**Status:** review
**Milestone:** 5.2.0
**Created:** Thu Mar 23, 2017 10:26 AM UTC by Chani Srivastava
**Last Updated:** Thu Apr 06, 2017 11:26 AM UTC
**Owner:** A V Mahesh (AVM)


Environment details

OS : Suse 11, 64bit Physical machine 
Changeset : 8634 ( 5.2.FC)
Setup : 4 nodes

There is considerable degradation in CKPT performance in 5.2 when compared to 
5.1. The times are calculated just before api and after api for which time 
difference is calculated.

-> For write operations, checkpoint write api is taking 2x the time taken in 
earlier release 5.1. Issue is observed in both synchronous and asynchronous 
mode.
( synchronous -- Checkpoint create flags used : SA_CKPT_WR_ALL_REPLICAS
asynchronous -- Checkpoint create flag used : SA_CKPT_WR_ACTIVE_REPLICA | 
SA_CKPT_CHECKPOINT_COLLOCATED ) Both local and remote replica

-> For section create operations in asynchronous mode for local replica, 
checkpoint section create api is taking more than 70% the earlier value in 5.1

-> For read operations in asynchronous mode for local replica, checkpoint read 
api is taking twice the time than in earlier value in 5.1

Please check the tickets pushed as part of 4.7 to 5.0, for which API 
performance got affected.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2413 smf: coredump, suspend is issued at completed state

2017-04-06 Thread Neelakanta Reddy
- Description has changed:

Diff:



--- old
+++ new
@@ -1,4 +1,4 @@
-ticket [#2145] looks to be causing this issue. 
+ticket #2145 looks to be causing this issue. 
 
 coredump printout is attached.
 



- **Comment**:

The following is the analysis:
1. From the FIGURE 7 from SMF AIS spec, async-failure is supported in the 
following campaign state:
SA_SMF_CMPG_EXECUTING
SA_SMF_CMPG_SUSPENDING_EXECUTION
SA_SMF_CMPG_ROLLING_BACK

>From the campaign perspective mark the campaign as 
>SA_SMF_CMPG_SUSPENDED_BY_ERROR_DETECTED only when
the present campaign state is one of the above. This will avoid smfd 
segmentation fault.

2. But, the saAmfSUMaintenanceCampaign will be reset(cleared) at the time of 
committing the campaign, the same has been said in section 4.2.1.3 of SMF AIS.

"When an upgrade campaign is committed, the Software Management Framework
must reset all the maintenance status attributes that refer to the campaign 
being committed.
Beyond this point, it cannot determine whether a failed entity was upgraded
by the campaign or not."

when the component is failed, in the states other than above states(like 
SA_SMF_CMPG_EXECUTION_COMPLETED)
the amfnd will not restart, since saAmfSUMaintenanceCampaign is not yet reset. 
Ideally the failed component has to be reset
because the campaign will not be moved to error state.



---

** [tickets:#2413] smf: coredump, suspend is issued at completed state**

**Status:** unassigned
**Milestone:** 5.2.0
**Created:** Wed Apr 05, 2017 12:39 PM UTC by Rafael
**Last Updated:** Thu Apr 06, 2017 10:35 AM UTC
**Owner:** nobody
**Attachments:**

- 
[osafsmfd.9276.SC-2.core.txt](https://sourceforge.net/p/opensaf/tickets/2413/attachment/osafsmfd.9276.SC-2.core.txt)
 (15.4 kB; text/plain)


ticket #2145 looks to be causing this issue. 

coredump printout is attached.

Steps to reproduce: run a campaign and have AMF compenent fail at the campaign 
completed state. This triggers a event in SMF which tries to suspend a 
completed campaign.

Function handleAmfObjectStateChangeNotification will try to call asyncFailure() 
which is the same as suspend() because the campaign is completed and commited 
this is not a valid transition. The campaign state instance is most likely 
deleted therefore we get a coredump.

For reference refer to figures 5, 6, 7 in SMF AIS. Starting from section 5.1.3


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2395 CKPT: Performance degradation ~100% (Time taken is almost double than previous)

2017-04-06 Thread Anders Widell
- **status**: assigned --> review



---

** [tickets:#2395] CKPT: Performance degradation ~100% (Time taken is almost 
double than previous)**

**Status:** review
**Milestone:** 5.2.0
**Created:** Thu Mar 23, 2017 10:26 AM UTC by Chani Srivastava
**Last Updated:** Thu Apr 06, 2017 10:33 AM UTC
**Owner:** A V Mahesh (AVM)


Environment details

OS : Suse 11, 64bit Physical machine 
Changeset : 8634 ( 5.2.FC)
Setup : 4 nodes

There is considerable degradation in CKPT performance in 5.2 when compared to 
5.1. The times are calculated just before api and after api for which time 
difference is calculated.

-> For write operations, checkpoint write api is taking 2x the time taken in 
earlier release 5.1. Issue is observed in both synchronous and asynchronous 
mode.
( synchronous -- Checkpoint create flags used : SA_CKPT_WR_ALL_REPLICAS
asynchronous -- Checkpoint create flag used : SA_CKPT_WR_ACTIVE_REPLICA | 
SA_CKPT_CHECKPOINT_COLLOCATED ) Both local and remote replica

-> For section create operations in asynchronous mode for local replica, 
checkpoint section create api is taking more than 70% the earlier value in 5.1

-> For read operations in asynchronous mode for local replica, checkpoint read 
api is taking twice the time than in earlier value in 5.1

Please check the tickets pushed as part of 4.7 to 5.0, for which API 
performance got affected.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #2413 smf: coredump, suspend is issued at completed state

2017-04-06 Thread elunlen
- Description has changed:

Diff:



--- old
+++ new
@@ -1,4 +1,4 @@
-ticket #2145 looks to be causing this issue. 
+ticket [#2145] looks to be causing this issue. 
 
 coredump printout is attached.
 






---

** [tickets:#2413] smf: coredump, suspend is issued at completed state**

**Status:** unassigned
**Milestone:** 5.2.0
**Created:** Wed Apr 05, 2017 12:39 PM UTC by Rafael
**Last Updated:** Wed Apr 05, 2017 12:39 PM UTC
**Owner:** nobody
**Attachments:**

- 
[osafsmfd.9276.SC-2.core.txt](https://sourceforge.net/p/opensaf/tickets/2413/attachment/osafsmfd.9276.SC-2.core.txt)
 (15.4 kB; text/plain)


ticket [#2145] looks to be causing this issue. 

coredump printout is attached.

Steps to reproduce: run a campaign and have AMF compenent fail at the campaign 
completed state. This triggers a event in SMF which tries to suspend a 
completed campaign.

Function handleAmfObjectStateChangeNotification will try to call asyncFailure() 
which is the same as suspend() because the campaign is completed and commited 
this is not a valid transition. The campaign state instance is most likely 
deleted therefore we get a coredump.

For reference refer to figures 5, 6, 7 in SMF AIS. Starting from section 5.1.3


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets