Hi list,
There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very
familiar with the corosync, so please correct me if I am wrong. The steps are
following:
1.Make sure the corosync debug is off
2.Start openais on every node, and all of them are ok.
3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list
started growing.
I got a piece of log from one node via corosync-blackbox:
rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd
rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd
rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to
pending delivery queue
rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb
rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd
rec=[79229] Log Message=got quorate request on 0x6d0980
rec=[79230] Log Message=got quorate request on 0x6d0980
rec=[79231] Log Message=Retransmit List 1
rec=[79232] Log Message=Retransmit List: 201
rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue
rec=[79234] Log Message=Retransmit List 1
rec=[79235] Log Message=Retransmit List: 201
rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205
rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79239] Log Message=Retransmit List 1
rec=[79240] Log Message=Retransmit List: 201
rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79242] Log Message=Retransmit List 2
rec=[79243] Log Message=Retransmit List: 201 202
rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205
rec=[79245] Log Message=Retransmit List 2
rec=[79246] Log Message=Retransmit List: 201 202
There is a piece of code in exec/totemsrp.c:
3775 if (range) {
3776 TRACE1 ("Delivering %x to %x\n",
instance->my_high_delivered,
3777 end_point);
3778 }
...
3785 for (i = 1; i <= range; i++) {
3786
3787 void *ptr = 0;
3788
3789 /*
3790 * If out of range of sort queue, stop assembly
3791 */
3792 res = sq_in_range (&instance->regular_sort_queue,
3793 my_high_delivered_stored + i);
3794 if (res == 0) {
3795 break;
3796 }
3797
3798 res = sq_item_get (&instance->regular_sort_queue,
3799 my_high_delivered_stored + i, &ptr);
3800 /*
3801 * If hole, stop assembly
3802 */
3803 if (res != 0 && skip == 0) {
3804 break;
3805 }
3806
3807 instance->my_high_delivered = my_high_delivered_stored + i;
...
3841 /*
3842 * Message found
3843 */
3844 TRACE1 ("Delivering MCAST message with seq %x to pending
delivery queue\n",
3845 mcast_header.seq);
>From these log and code, We could know that the message 1fe 1ff 200 have not
>been delivered and it should jump out of the loop through the two break
>sentences.
The first if only check the seq id range, and the second one should be the most
suspect.
include/corosync/sq.h:
264 static inline unsigned int sq_item_get (
265 const struct sq *sq,
266 unsigned int seq_id,
267 void **sq_item_out)
...
286 if (sq->items_inuse[sq_position] == 0) {
287 return (ENOENT);
288 }
I think the items_inuse array maybe cleared sometimes, and it return 0 when we
access it. However, I couldn't study deep in more, so could anyone give me some
hints?
--
Best regards,
Guangliang
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss