[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-07 Thread Hans Feldt
What is TCP broadcast? I have never heard of that... My guess is that DTM_MCAST_ADDR allows you to specify the UDP multicast address to be used for discovery. In Adrians case there is no broadcast address on the eth0 interface in each container so he has to specify it instead if using the one f

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-07 Thread Hans Feldt
this backtrace indicates it is the same as https://sourceforge.net/p/opensaf/tickets/1157/ duplicate of https://sourceforge.net/p/opensaf/tickets/607/ --- ** [tickets:#1072] Sync stop after few payload nodes joining the cluster (TCP)** **Status:** unassigned **Milestone:** 4.4.2 **Created:**

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-06 Thread Adrian Szwej
2) No; I have DTM_MCAST_ADDR=224.0.0.6. Leaving it empty does not work for me. Is there any difference; why do I need to change to broadcast mode? 3) With 58860 I only can bring 5-6 payloads Then TRY_AGAIN happen every time. 4) Yes reduce to lower value allows me to pass 5-6 nodes up to ~50 wit

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-06 Thread A V Mahesh (AVM)
Based on your data I understand following is your current status : 1) You are running opensaf in docker containers , and the Containers have addresses 172.17.0.1 - 172.17.0.150 2) You configured TCP Broadcast ( that means DTM_MCAST_ADDR= is empty ). only you updated the `DTM_NODE_IP`

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-06 Thread Adrian Szwej
Hi I am starting to think it a bug in the batch sync logic or in combination with MDS fragmentation. Changing the default value 55388 to something lower; like 4096 don't trigger the bug. immcfg -a opensafImmSyncBatchSize=4096 opensafImm=opensafImm,safApp=safImmService With this configura

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-06 Thread Adrian Szwej
I am running opensaf in docker containers: * one cluster. * have dont have any iptables rules. * can reach internet from my containers * can multicast network to other nodes in my network. All containers are connected to docker0 bridge: inet addr:172.17.42.1 bridge name brid

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-05 Thread A V Mahesh (AVM)
It seems very fundamental TCP cluster bring-up with Broadcast is not working for you So let us start from basic configuration. 1) please make sure all of you node are in same sub-net say like : SC-1 : 192.168.56.101slot -1 SC-2 : 192.168.56.102 slot -2 PL-3 : 192.168.56.103 slot -3 P

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-03 Thread Adrian Szwej
It does not work for me with empty DTM_MCAST_ADDR The payload node just loops with; Oct 3 19:08:42.162880 osafimmnd [3275:immnd_proc.c:0393] TR First immnd_introduceMe, sending pbeEnabled:3 WITH params Oct 3 19:08:42.163181 osafimmnd [3275:immnd_proc.c:0413] TR Possibly extended intro

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-03 Thread A V Mahesh (AVM)
>Have you tried to start 7 nodes in container setup joining them one by one? I am assuming that at a given point of time one node should be rebooted in cluster. if yes , it did test rebooting some payload and it works for me with TIPC Broadcast , if no please provide sequence of reboots that yo

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-03 Thread A V Mahesh (AVM)
On 10/3/2014 12:11 PM, Adrian Szwej wrote: > Yes; I meant #1036. I got instruction to test this patch to see if it help. This Bug fix is for exclusively for TIPC , so TCP not effective in any manner . > DTMD config; > DTM_NODE_IP=172.17.0.109 > DTM_MCAST_ADDR=224.0.0.6 It is news to me that y

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-02 Thread Adrian Szwej
Hi Mahesh Yes; I meant #1036. I got instruction to test this patch to see if it help. BR **DTMD config**; DTM_NODE_IP=172.17.0.109 DTM_MCAST_ADDR=224.0.0.6 **imm.xml** Default generated 7-70 nodes. Does not matter. It is reproduciple with around 6-8 nodes. immnd tracing seem to trigg

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-02 Thread Adrian Szwej
Mahesh; I have managed to bring 30 containers on one VM for quite some time with export IMMSV_NUM_NODES=30 export IMMSV_MAX_WAIT=50. So initial loading seem to work differently than syncing node on join. The biggest concern I have is the fault analysis here. I have troubleshooted logs, mds log,

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-10-01 Thread A V Mahesh (AVM)
>On 10/2/2014 12:09 AM, Adrian Szwej wrote: > I have now applied patch for #1032 ontop of 4.6 changeset 5969:ead18326c13b. You mean [#1036] ? > [devel] [PATCH 1 of 1] mds: use correct buff-length to distinguish > mcast or multi-unicast [#1036] > This patch does not resolve the problem. This pa

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-17 Thread Anders Bjornerstedt
Hi Adrian, I have re-open the ticket and change component to MDS. MDS responsible may be able to diagnose the cause just based on the coredump. I have not checked the MDS backlog if there is any older ticket documenting similar symptoms. https://sourceforge.net/p/opensaf/tickets/search/?q

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-17 Thread Adrian Szwej
#0 0x7fe7eba49bb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fe7eba4cfc8 in __GI_abort () at abort.c:89 #2 0x7fe7eba42a76 in __assert_fail_base (fmt=0x7fe7ebb94370 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fe7

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-17 Thread Adrian Szwej
It is the IMMD that is crashing causing the messages to become pending. I am attaching coredump and immnd and immd trace files from SC-1 where 7 nodes join one by one. When PL-8 joins; the IMMD coredumps. The code used was changeset 5828:df7bef2079b1 + change of IMMSV_DEFAULT_FEVS_MAX_PENDING to

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-15 Thread Anders Bjornerstedt
Instead of blindly changing other configuration parameters, please first try to find out what the PROBLEM is. Go back to OpensAF defaults on all settings, except IMMSV_FEVS_MAX_PENDING which you had increased to 255 (the maximum possible). You said you had "managed to overcome the perormance iss

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-15 Thread A V Mahesh (AVM)
Some time back ,I bought-up 30 Nodes with TCP transport with out any issue, at that time In addition to increasing Larger MDS buffers(MDS_SOCK_SND_RCV_BUF_SIZE & DTM_SOCK_SND_RCV_BUF_SIZE), I also increased wmem_max & rmem_max, you also give a try. sysctl -w net.core.wmem_max=33554432 sysctl -

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-15 Thread Anders Bjornerstedt
Well a hint is that you managed to bypass the problem (temporarily) by increasing a queue size. The error: Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too many pending incoming fevs messages (> 16) rejecting sync iteration next request Is very rarely seen, but can hap

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

2014-09-15 Thread Adrian Szwej
I don't think it is performance problems. There is nothing indicating CPU load; memory; nor IO bandwith. This is just a simple node joining seem to trigger some "logical" bug. There is no application; but just pure opensaf. I am now trying to elaborate with different MDS configuration options and