On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <[email protected]> wrote:
> On 11/14/2013 02:22 AM, Christine Caulfield wrote: > >> On 14/11/13 05:01, John Thompson wrote: >> >>> Hi, >>> >>> I am using corosync in a cluster that includes both big and little >>> endian systems and am coming >>> across crashes when there are retransmits in the cluster. >>> >>> I wondered therefore if others had tried this previously? >>> >>> As part of this I have identified that totempg_deliver_fn modifies the >>> mcast msg in place to >>> convert for endian purposes, even though it might still be on a sort >>> queue and used for retransmission. >>> This means that if there are different endian systems operating and a >>> retransmission of the msg >>> is performed, it will have been endian converted in-place and so what >>> the node receives is a message that has some endian converted fields. >>> >>> I will submit a patch for this. >>> >> >> Endian conversion happens on receipt of the message and is based upon a > field in the message indicating which endian the message was originated > with. If a message is changed in a retransmit queue, I would expect it's > endian field is also modified, resulting in newly transmitted messages > being correctly decoded by the receivers. > > When totem was originally written in Corosync, we had ppc, arm, and x86_64 > as all major platforms for Corosync. But corosync hasn't been tried in > years on these platforms. It did work grand at one point ;) Most of the > world has moved to x86_64 so the need hasn't presented itself to focus on > this area of the code base lately. > > > > I suspect it hasn't been tried for a very long time! if you have a patch >> that fixes the bug it will be gratefully received :-) >> >> Chrissie >> >> Thanks for the responses. I was trying out corosync in a cluster with a big endian & 4 little endian systems. When there was a degree of packet loss, that lead to retransmissions occurring, a crash would occur. This I worked out was in totempg_deliver_fn where the mcast->msg_count field was VERY high. When checking the number out it looked to be endian swapped. So I tried out endian swapping to a local variable in this function and the totempg_deliver_fn crashes no longer occur. I have looked into it further and believe this is because totemsrp.c:messages_deliver_to_app (which ends up calling totempg_deliver_fn) is delivering whilst the msg remains on the regular_sort_queue which can be used for retransmission purposes. This therefore means that if the msg_count gets endian swapped in place and the message has to be retransmitted then the node that requested the retransmission gets a message where the msg_count has been previously endian swapped. I have sent in a patch that resolves this problem. The only problem I have with it is what I have changed around the fragmentation case. I think I have this wrong and am preparing the patch to get this right. Thanks, John
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
