Re: [corosync] Has anyone used corosync with both big & little endian systems in a single cluster?

John Thompson Sun, 17 Nov 2013 20:40:59 -0800

On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <[email protected]> wrote:

> On 11/14/2013 02:22 AM, Christine Caulfield wrote:
>
>> On 14/11/13 05:01, John Thompson wrote:
>>
>>> Hi,
>>>
>>> I am using corosync in a cluster that includes both big and little
>>> endian systems and am coming
>>> across crashes when there are retransmits in the cluster.
>>>
>>> I wondered therefore if others had tried this previously?
>>>
>>> As part of this I have identified that totempg_deliver_fn modifies the
>>> mcast msg in place to
>>> convert for endian purposes, even though it might still be on a sort
>>> queue and used for retransmission.
>>> This means that if there are different endian systems operating and a
>>> retransmission of the msg
>>> is performed, it will have been endian converted in-place and so what
>>> the node receives is a message that has some endian converted fields.
>>>
>>> I will submit a patch for this.
>>>
>>
>>  Endian conversion happens on receipt of the message and is based upon a
> field in the message indicating which endian the message was originated
> with.  If a message is changed in a retransmit queue, I would expect it's
> endian field is also modified, resulting in newly transmitted messages
> being correctly decoded by the receivers.
>
> When totem was originally written in Corosync, we had ppc, arm, and x86_64
> as all major platforms for Corosync.  But corosync hasn't been tried in
> years on these platforms.  It did work grand at one point ;)  Most of the
> world has moved to x86_64 so the need hasn't presented itself to focus on
> this area of the code base lately.
>
>
>
>  I suspect it hasn't been tried for a very long time! if you have a patch
>> that fixes the bug it will be gratefully received :-)
>>
>> Chrissie
>>
>>
Thanks for the responses.

I was trying out corosync in a cluster with a big endian & 4 little endian
systems.  When there was a degree of packet loss, that lead to
retransmissions occurring, a crash would occur.  This I worked out was in
totempg_deliver_fn where the mcast->msg_count field was VERY high.  When
checking the number out it looked to be endian swapped.  So I tried out
endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.

I have looked into it further and believe this is because
totemsrp.c:messages_deliver_to_app (which ends up calling
totempg_deliver_fn) is delivering whilst the msg remains on the
regular_sort_queue which can be used for
retransmission purposes.  This therefore means that if the msg_count gets
endian swapped in place and the message has to be retransmitted then the
node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.

I have sent in a patch that resolves this problem.  The only problem I have
with it is what I have changed around the fragmentation case.  I think I
have this wrong and am preparing the patch to get this right.

Thanks,
John

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Re: [corosync] Has anyone used corosync with both big & little endian systems in a single cluster?

Reply via email to