Re: [DISCUSSION] 3.0/3.1 compatibility on aarch64 and some other platforms

Aleksandr Polovtsev Wed, 23 Jul 2025 11:50:18 -0700

Thank you, Ivan, for the detailed explanation.
Do I understand correctly, that mostly affected are the users that are
currently using Ignite on aarch64 (and some other, rare architectures)?
If yes, then your proposal makes sense, Ignite is mostly targeted at the
servers which usually run on x86_64.
However, I think that providing a log storage conversion tool along with
the release may be a good idea from the UX standpoint. Because if I'm an
affected user, I would expect to run this script straight away and not
write to the dev list and wait for the script to be delivered.


On Wed, Jul 23, 2025 at 5:00 PM Ivan Bessonov <[email protected]> wrote:

> Hello, Igniters!
>
> Recently we encountered an unexpected issue. Let me start with its roots,
> before I start
> discussing potential fixes.
>
> We noticed that certain benchmarks showed some inefficiencies when being
> run on new
> MacBooks. They were related to low-level serialization code, and the cause
> of it was an
> unaligned read in GridUnsafe. "aarch64" allows it, but the architecture is
> not included in the
> "GridUnsafe#unaligned" check, which resulted in the execution of fall-back
> code that reads
> and writes everything byte by byte.
>
> The fix seemed trivial, and we did it in [1] by adding "aarch64" into the
> list of architectures that
> support unaligned memory access. After a while, when we enabled the
> "ItCompatibilityTest#testCompatibility", we realized that compatibility on
> MacBooks is broken.
> The incompatibility has been caused by [1], and as a hotfix, it has been
> temporarily reverted
> in [2].
>
> How was that possible?
> When we finished the investigation, it turned out
> "DirectByteBufferStreamImplV1#writeUuid"
> and "DirectByteBufferStreamImplV1#readUuid" have a particularly nasty bug
> in them. This is
> how these methods behave in 3.0:
>  - If we run on an "i386", "x86", "amd64", or "x86_64", we will write parts
> of UUID in Big Endian.
>  - If we run on other Little Endian architectures, we will write these
> parts in Little Endian.
>  - If we run on a Big Endian architecture, we will write these parts in Big
> Endian.
>
> When we added "aarch64" to the list of "unaligned" architectures, we
> started treating its data
> as BE in "main" while Ignite 3.0 treats it as LE. For the clarification -
> this stream is used for
> - Network communication, runtime only.
> - Serialization of raft commands, this data is written to the storage.
> That's why fix [1] broke compatibility.
>
> Such a behavior constitutes a problem, because network protocol and raft
> serialization must be
> architecture-independent:
> - It is possible that nodes in the same cluster are run in different
> environments with different
>   architectures.
> - It is possible, and almost guaranteed, that raft command serialization
> happens on a different
>   node, and thus must also be architecture-independent.
>   (node A does the serialization, node B writes resulted payload into the
> log storage)
>
> That's issue number 1. The issue number 2 was found when we inspected the
> code of
> "DirectByteBufferStreamImplV1". "writeFixedInt"/"readFixedInt" (long too)
> methods parity
> is violated in BE architectures. Writes are always LE, but read uses native
> bytes ordering.
>
> In other words, Ignite 3.0 doesn't really work on Big Endian architectures.
> Fixing this place
> in particular is trivial, we will do it in 3.1. Fixing broken Little Endian
> architectures might not
> be as trivial.
>
> My proposal is the following:
> - We fix the bug in UUID serialization, and always use Big Endian for
> encoding there. This
>   will make our protocols correct on all architectures at once.
>   This fix will break backwards compatibility on Little Endian
> architectures that are NOT
>   included in the following list: "i386", "x86", "amd64", and "x86_64".
>   This means that an upgrade from 3.0 to 3.1 will be impossible*.
> - We add "aarch64" into the list of architectures that support unaligned
> memory access.
> - We explicitly disable "ItCompatibilityTest#testCompatibility" on a number
> of architectures.
> - * If it turns out that we have a user, who uses one of those
> architectures and who must
>   upgrade their cluster from 3.0, we will prepare and provide a log storage
> conversion tool
>   that will replace all Little Endian UUIDs to Big Endian format. As far as
> I'm aware, only log
>   storage is affected at the moment.
>
> It's better to fix it in 3.1, because it will be more widely adopted than
> 3.0. I will do that in [3].
> Please provide your feedback to the proposal. What are your thoughts? Thank
> you!
>
> [1] https://issues.apache.org/jira/browse/IGNITE-25564
> [2] https://issues.apache.org/jira/browse/IGNITE-25796
> [3] https://issues.apache.org/jira/browse/IGNITE-25797
>
> --
> Sincerely yours,
> Ivan Bessonov
>


-- 
With regards,
Aleksandr Polovtsev

Re: [DISCUSSION] 3.0/3.1 compatibility on aarch64 and some other platforms

Reply via email to