On Tue, Mar 19, 2013 at 07:44:21AM -0700, Steven Dake wrote:
> On 03/19/2013 03:18 AM, Guangliang Zhao wrote:
> >Hi list,

Hi Steven,

Thanks for your reply.

> >
> >There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very 
> >familiar with the corosync, so please correct me if I am wrong. The steps 
> >are following:
> >
> >1.Make sure the corosync debug is off
> >2.Start openais on every node, and all of them are ok.
> >3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list 
> >started growing.
> >
> >I got a piece of log from one node via corosync-blackbox:
> >
> >rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd
> >rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd
> >rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to 
> >pending delivery queue
> >rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb
> >rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd
> >rec=[79229] Log Message=got quorate request on 0x6d0980
> >rec=[79230] Log Message=got quorate request on 0x6d0980
> >rec=[79231] Log Message=Retransmit List 1
> >rec=[79232] Log Message=Retransmit List: 201
> >rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue
> >rec=[79234] Log Message=Retransmit List 1
> >rec=[79235] Log Message=Retransmit List: 201
> >rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205
> >rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79239] Log Message=Retransmit List 1
> >rec=[79240] Log Message=Retransmit List: 201
> >rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79242] Log Message=Retransmit List 2
> >rec=[79243] Log Message=Retransmit List: 201 202
> >rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79245] Log Message=Retransmit List 2
> >rec=[79246] Log Message=Retransmit List: 201 202
> >
> >There is a piece of code in exec/totemsrp.c:
> >
> >3775         if (range) {
> >3776                 TRACE1 ("Delivering %x to %x\n", 
> >instance->my_high_delivered,
> >3777                         end_point);
> >3778         }
> >
> >...
> >
> >3785         for (i = 1; i <= range; i++) {
> >3786
> >3787                 void *ptr = 0;
> >3788
> >3789                 /*
> >3790                  * If out of range of sort queue, stop assembly
> >3791                  */
> >3792                 res = sq_in_range (&instance->regular_sort_queue,
> >3793                         my_high_delivered_stored + i);
> >3794                 if (res == 0) {
> >3795                         break;
> >3796                 }
> >3797
> >3798                 res = sq_item_get (&instance->regular_sort_queue,
> >3799                         my_high_delivered_stored + i, &ptr);
> >3800                 /*
> >3801                  * If hole, stop assembly
> >3802                  */
> >3803                 if (res != 0 && skip == 0) {
> >3804                         break;
> >3805                 }
> >3806
> >3807                 instance->my_high_delivered = my_high_delivered_stored 
> >+ i;
> >
> >...
> >
> >3841                 /*
> >3842                  * Message found
> >3843                  */
> >3844                 TRACE1 ("Delivering MCAST message with seq %x to 
> >pending delivery queue\n",
> >3845                         mcast_header.seq);
> >
> > From these log and code, We could know that the message 1fe 1ff 200 have 
> > not been delivered and it should jump out of the loop through the two break 
> > sentences.
> >
> >The first if only check the seq id range, and the second one should be the 
> >most suspect.
> >
> >include/corosync/sq.h:
> >
> >264 static inline unsigned int sq_item_get (
> >265         const struct sq *sq,
> >266         unsigned int seq_id,
> >267         void **sq_item_out)
> >
> >...
> >
> >286         if (sq->items_inuse[sq_position] == 0) {
> >287                 return (ENOENT);
> >288         }
> >I think the items_inuse array maybe cleared sometimes, and it return 0 when 
> >we access it. However, I couldn't study deep in more, so could anyone give 
> >me some hints?
> >
> 
> items_inuse[sq_position] should contain zero if there is no entry.
> If there is no entry, we want to stop processing in the above code
> because it is a hole in the messages.

If we want skip the hole in the messages, I think the my_high_delivered
or more parameters should be updated, but didn't, so it always try to
deliver the messages from my_high_delivered + 1, but couldn't success,
because the my_high_delivered + 1 message is a hole?   

I collected the result of corosync-blackbox from one of the nodes, but it is a
pretty big log, I would add it as an attachment next mail if you need.

> 
> The sort queue is a circular array which is cleared as
> sq_item_release is called.  This should only occur after the message
> has been delivered to all nodes on the ring in
> totemsrp.c:messages_free.
> 
> Regards
> -steve
> 

-- 
Best regards,
Guangliang
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to