In normal conditions we are not able to reproduce the problem by doing
`/etc/init.d/opensafd restart `
so can please provide following information , to reproduce the problem:
1) Can you please share or elaborate what "./opensaf nodestop" "./opensaf
nodestart"
scripts do aprt of ` /etc/init.d/opensafd stop` & `/etc/init.d/opensafd
restart
2) is their any other NON Opensaf application using MDS/TCP libariry ?
if so are they stoped cleanly before ` /etc/init.d/opensafd stop`
---
** [tickets:#2030] dtm: "Node already exit in the cluster with smiler
configuration"**
**Status:** assigned
**Milestone:** 5.0.2
**Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell
**Last Updated:** Mon Sep 26, 2016 02:26 PM UTC
**Owner:** A V Mahesh (AVM)
osafdtm does not handle rapid consecutive node reboots properly. I got the
following errors in syslog:
~~~
Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in
the cluster with smiler configuration , correct the other joining Node
configuration
Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed
.node_ip: 192.168.0.1, node_id: 0
Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed
.node_ip: 192.168.0.1, node_id: 0
~~~
Here are the steps to reproduce this problem in UML:
./opensaf start
(wait until the cluster comes up)
./opensaf nodestop 2
(wait a few seconds)
./opensaf nodestart 2
./opensaf nodestart 2
The last two commands should be execute quickly after each other, maybe with
one second delay in between them.
It seems that osafdtmd asserts and dies when this happens. Here is the result
from a second run of the above test:
~~~
Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM: Node already exit in
the cluster with smiler configuration , correct the other joining Node
configuration
Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109:
dtm_process_node_info: Assertion '0' failed.
Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with
'SC-1'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with
'PL-4'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with
'PL-5'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with
'PL-3'
Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the
node
Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node;
timeout=60
~~~
Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf
nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and
the error message "Node already exit in the cluster with smiler configuration"
should be interpreted as "duplicate node detected in the network". Reducing the
priority of this defect to "minor". Still two problems ought to be fixed: the
error message should be changed so that it is clear what it means, and osafdtmd
should not assert (it could call opensaf_reboot() if a there is a configuration
problem, but asserting idicates a software problem).
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets