Re: Miscomputation with big-endian arm asm

Michael Weiser Sat, 10 Feb 2018 14:31:37 -0800

Hi Niels,

On Wed, Feb 07, 2018 at 01:13:32PM +0100, Niels Möller wrote:


> >> What's the host triplet?
> >
> > armv7veb-hardfloat-linux-gnueabi
>         ^^

> And the "eb" is for big-endian?

Only the b actually. ve stands for virtualization extensions:
http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01783.html.
But that's just my fancy. More common triples would most likely use
armv7b or armv7eb and the above should perhaps have been armv7veeb. :)

> > #  define WORDS_BIGENDIAN 1
> Can you check if it's detected correctly also when cross-compiling?

# ./configure --host=armv7veb-hardfloat-linux-gnueabi
checking build system type... x86_64-unknown-linux-gnu
checking host system type... armv7veb-hardfloat-linux-gnueabi
[...]
configure: summary of build options:

  Version:           nettle 3.4
  Host type:         armv7veb-hardfloat-linux-gnueabi
  ABI:               standard
  Assembly files:    arm/v6 arm
  Install prefix:    /usr/local
  Library directory: ${exec_prefix}/lib
  Compiler:          armv7veb-hardfloat-linux-gnueabi-gcc
  Static libraries:  yes
  Shared libraries:  yes
  Public key crypto: no
  Using mini-gmp:    no
  Documentation:     yes

# grep WORDS_BIG config.h
/* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most
#  define WORDS_BIGENDIAN 1
# ifndef WORDS_BIGENDIAN
#  define WORDS_BIGENDIAN 1

Seems fine.

> > FAIL: memxor
> This also does some tricks with word reads and rotate. (The C code does
> that too, but with conditions on WORDS_BIGENDIAN).

I think I got memxor, sha1 and sha256 sorted. Patch below.

> > FAIL: chacha
> The chacha code doesn't look endian-dependent to me. I'd guess it's a
> consequence of incorrect memxor (below).

This one is still failing, even though memxor and sha are fixed. I've
been looking at the code and can't find any apparent reason. In
chacha-core-internal.c I see the following bit of code that does seem to
do endianness handling:

      dst[i] = LE_SWAP32 (t);

Would this apply to chacha-core-internal.asm, too?

> > FAIL: umac
> Similar problem, I would guess. But this time, loading 64 bits at a time
> into neon registers.

I'm drawing a bit of a blank on this one. It fails on the very first
test case of umac32 where only umac-nh is used and all the input is
zeroes. So there does seem to be another endianness dependency in the
actual computation code. Have I understood correctly, that vld1.8 reads
a byte stream and should be endianness-neutral anyway and the keys are
in host endianness?

> If you feel like,
> v6/aes-*.asm could also use better code for aligned reading of input
> data. 

Huh, getting existing code to work again is one thing. But actual better
code is certainly beyond me. :-/

> Aarch64 assembly (for both endian flavors) would be nice, but it's a
> separate project. I haven't yet looked into aarch64-assembly. I made an
> attempt to build nettle under termux on my android phone a while ago,
> but it failed because it didn't provide /bin/sh at the expected place.

Sorry, I think I had confused nettle with an other library I came across
during debugging which had armv8 code. Again, I think I should leave
producing actually working and efficient assembler code to someone who
knows what they're doing. :)

> >> Before attempting to support big-endian arm, I'd need some idea on how
> >> to test it.
> >
> > Any halfway current ARM cross toolchain should be able to also output
> > big-endian arm binaries (-mbig-endian). Then you could test those with
> > qemu-user-armeb, which is very light-weight in that it doesn't need a
> > kernel or emulated system and allows to run binaries directly.
> Sounds good. I hope the needed tools are packaged in debian, I'll have
> to check that.

I was wrong: While the compiler is able to output big-endian objects
with -mbig-endian, it needs matching libs as well (e.g. libgcc_s).
Debian doesn't have anything precompiled for armeb. They refer you to
Linaro's toolchains or rebootstrap for building from scratch instead (I do
something similar with crossdev on Gentoo).

This Linaro toolchain works for me:
https://releases.linaro.org/components/toolchain/binaries/latest/armeb-linux-gnueabihf/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf.tar.xz

michael@debian:~/nettle$ 
PATH=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/bin:$PATH 
./configure --host=armeb-linux-gnueabihf
michael@debian:~/nettle$ make
[...]
michael@debian:~/nettle$ file libnettle.so
libnettle.so: ELF 32-bit MSB shared object, ARM, EABI5 BE8 version 1
(SYSV), dynamically linked,
BuildID[sha1]=1a8daa9c1d3e61b9d99d34f462337d02c47c9d74, with debug_info,
not stripped
michael@debian:~/nettle$ make testsuite/sha1-test

Now qemu can be installed, which automatically registers with binfmt so
that arm binaries can just be executed:

michael@debian:~/nettle$ sudo apt-get install qemu-user-static
michael@debian:~/nettle$ file testsuite/sha1-test
testsuite/sha1-test: ELF 32-bit MSB executable, ARM, EABI5 BE8 version 1
(SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for
GNU/Linux 3.2.0, BuildID[sha1]=ec39b7153f4c09d11cac92d34c8e509bb1f4d0a0,
with debug_info, not stripped
michael@debian:~/nettle$ testsuite/sha1-test
/lib/ld-linux-armhf.so.3: No such file or directory
michael@debian:~/nettle$ 
QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
 testsuite/sha1-test
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault

This segfaults because of a bug in qemu where it tries to use the host's
/etc/ld.so.cache. Deleting it "solves" that. Alternatively, it could be
run in a chroot to avoid the segfault but would require some fiddling
with the compiler's sysroot.

michael@debian:~/nettle$ sudo rm /etc/ld.so.cache 
michael@debian:~/nettle$ 
QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
 testsuite/sha1-test
testsuite/sha1-test: error while loading shared libraries: libnettle.so.6: 
cannot open shared object file: No such file or directory
michael@debian:~/nettle$ ln -sfn libnettle.so libnettle.so.6
michael@debian:~/nettle$ 
QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
 LD_LIBRARY_PATH=. testsuite/sha1-test

This worked because configure detected only generic arm support:

  Assembly files:    arm

So plain arm assembly seems to be BE-safe. :)
After hacking configure to also enable arm/v6 with this triple I get:

michael@debian:~/nettle$ 
QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
 LD_LIBRARY_PATH=. testsuite/sha1-test

Got:

9844f81e1408f6ec b932137d33bed7cf
dcf518a3

Expected:

da39a3ee5e6b4b0d 3255bfef95601890
afd80709
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted

Which seems about right. With the patch that goes away:

michael@debian:~/nettle$ git am 
0001-Support-big-endian-arm-in-sha1-armv6-assembly-code.patch 
Applying: Support big-endian arm in sha1 armv6 assembly code
[make && make check]
michael@debian:~/nettle$ 
QEMU_LD_PREFIX=$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
 LD_LIBRARY_PATH=. testsuite/sha1-test
michael@debian:~/nettle$

I also tried rebootstrap but this quickly got really involved.

> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -691,6 +691,7 @@ ASM_TYPE_FUNCTION='@function'
> >  ASM_TYPE_PROGBITS='@progbits'
> >  ASM_MARK_NOEXEC_STACK=''
> >  ASM_ALIGN_LOG=''
> > +ASM_WORDS_BIGENDIAN="$ac_cv_c_bigendian"
> If you have the time, it would be good to file an autoconf bug report,
> asking them to document (and support) that AC_C_BIGENDIAN sets the shell
> variable ac_cv_c_bigendian.

Instead I augmented the default action (which is documented and
shouldn't change) by setting ASM_WORDS_BIGENDIAN directly. Also this
should make the explicit value checking in IF_BE redundant because we
now know for sure configure will never emit anything other than yes and
no. Documentation says that AC_C_BIGENDIAN will abort if endianness
can't be determined.

From db70ecccdc65a97c103f3900b4f45d8370c1dd62 Mon Sep 17 00:00:00 2001
From: Michael Weiser <michael.wei...@gmx.de>
Date: Wed, 7 Feb 2018 00:11:24 +0100
Subject: [PATCH] Support big-endian arm in assembly code

Introduce m4 macros to conditionally handle differences of little- and
big-endian arm in assembler code. Adjust sha1-compress, sha256-compress
and memxor for arm to work in big-endian mode.
---
 arm/memxor.asm             | 21 +++++++++++++++-----
 arm/memxor3.asm            | 49 ++++++++++++++++++++++++++++++----------------
 arm/v6/sha1-compress.asm   |  8 ++++++--
 arm/v6/sha256-compress.asm | 14 ++++++++-----
 asm.m4                     |  3 +++
 config.m4.in               |  1 +
 configure.ac               |  5 ++++-
 7 files changed, 71 insertions(+), 30 deletions(-)

diff --git a/arm/memxor.asm b/arm/memxor.asm
index a50e91bc..239a4034 100644
--- a/arm/memxor.asm
+++ b/arm/memxor.asm
@@ -44,6 +44,11 @@ define(<N>, <r2>)
 define(<CNT>, <r6>)
 define(<TNC>, <r12>)
 
+C little-endian and big-endian need to shift in different directions for
+C alignment correction
+define(<S0ADJ>, IF_LE(<lsr>, <lsl>))
+define(<S1ADJ>, IF_LE(<lsl>, <lsr>))
+
        .syntax unified
 
        .file "memxor.asm"
@@ -99,6 +104,8 @@ PROLOGUE(nettle_memxor)
        C
        C With little-endian, we need to do
        C DST[i] ^= (SRC[i] >> CNT) ^ (SRC[i+1] << TNC)
+       C With big-endian, we need to do
+       C DST[i] ^= (SRC[i] << CNT) ^ (SRC[i+1] >> TNC)
 
        push    {r4,r5,r6}
        
@@ -117,14 +124,14 @@ PROLOGUE(nettle_memxor)
 .Lmemxor_word_loop:
        ldr     r5, [SRC], #+4
        ldr     r3, [DST]
-       eor     r3, r3, r4, lsr CNT
-       eor     r3, r3, r5, lsl TNC
+       eor     r3, r3, r4, S0ADJ CNT
+       eor     r3, r3, r5, S1ADJ TNC
        str     r3, [DST], #+4
 .Lmemxor_odd:
        ldr     r4, [SRC], #+4
        ldr     r3, [DST]
-       eor     r3, r3, r5, lsr CNT
-       eor     r3, r3, r4, lsl TNC
+       eor     r3, r3, r5, S0ADJ CNT
+       eor     r3, r3, r4, S1ADJ TNC
        str     r3, [DST], #+4
        subs    N, #8
        bcs     .Lmemxor_word_loop
@@ -132,10 +139,14 @@ PROLOGUE(nettle_memxor)
        beq     .Lmemxor_odd_done
 
        C We have TNC/8 left-over bytes in r4, high end
-       lsr     r4, CNT
+       S0ADJ   r4, CNT
        ldr     r3, [DST]
        eor     r3, r4
 
+       C memxor_leftover does an LSB store
+       C so we need to reverse if actually BE
+IF_BE(<        rev     r3, r3>)
+
        pop     {r4,r5,r6}
 
        C Store bytes, one by one.
diff --git a/arm/memxor3.asm b/arm/memxor3.asm
index 139fd208..69598e1c 100644
--- a/arm/memxor3.asm
+++ b/arm/memxor3.asm
@@ -49,6 +49,11 @@ define(<ATNC>, <r10>)
 define(<BCNT>, <r11>)
 define(<BTNC>, <r12>)
 
+C little-endian and big-endian need to shift in different directions for
+C alignment correction
+define(<S0ADJ>, IF_LE(<lsr>, <lsl>))
+define(<S1ADJ>, IF_LE(<lsl>, <lsr>))
+
        .syntax unified
 
        .file "memxor3.asm"
@@ -124,6 +129,8 @@ PROLOGUE(nettle_memxor3)
        C
        C With little-endian, we need to do
        C DST[i-i] ^= (SRC[i-i] >> CNT) ^ (SRC[i] << TNC)
+       C With big-endian, we need to do
+       C DST[i-i] ^= (SRC[i-i] << CNT) ^ (SRC[i] >> TNC)
        rsb     ATNC, ACNT, #32
        bic     BP, #3
 
@@ -138,14 +145,14 @@ PROLOGUE(nettle_memxor3)
 .Lmemxor3_au_loop:
        ldr     r5, [BP, #-4]!
        ldr     r6, [AP, #-4]!
-       eor     r6, r6, r4, lsl ATNC
-       eor     r6, r6, r5, lsr ACNT
+       eor     r6, r6, r4, S1ADJ ATNC
+       eor     r6, r6, r5, S0ADJ ACNT
        str     r6, [DST, #-4]!
 .Lmemxor3_au_odd:
        ldr     r4, [BP, #-4]!
        ldr     r6, [AP, #-4]!
-       eor     r6, r6, r5, lsl ATNC
-       eor     r6, r6, r4, lsr ACNT
+       eor     r6, r6, r5, S1ADJ ATNC
+       eor     r6, r6, r4, S0ADJ ACNT
        str     r6, [DST, #-4]!
        subs    N, #8
        bcs     .Lmemxor3_au_loop
@@ -154,7 +161,11 @@ PROLOGUE(nettle_memxor3)
 
        C Leftover bytes in r4, low end
        ldr     r5, [AP, #-4]
-       eor     r4, r5, r4, lsl ATNC
+       eor     r4, r5, r4, S1ADJ ATNC
+
+       C leftover does an LSB store
+       C so we need to reverse if actually BE
+IF_BE(<        rev     r4, r4>)
 
 .Lmemxor3_au_leftover:
        C Store a byte at a time
@@ -247,21 +258,25 @@ PROLOGUE(nettle_memxor3)
        ldr     r5, [AP, #-4]!
        ldr     r6, [BP, #-4]!
        eor     r5, r6
-       lsl     r4, ATNC
-       eor     r4, r4, r5, lsr ACNT
+       S1ADJ   r4, ATNC
+       eor     r4, r4, r5, S0ADJ ACNT
        str     r4, [DST, #-4]!
 .Lmemxor3_uu_odd:
        ldr     r4, [AP, #-4]!
        ldr     r6, [BP, #-4]!
        eor     r4, r6
-       lsl     r5, ATNC
-       eor     r5, r5, r4, lsr ACNT
+       S1ADJ   r5, ATNC
+       eor     r5, r5, r4, S0ADJ ACNT
        str     r5, [DST, #-4]!
        subs    N, #8
        bcs     .Lmemxor3_uu_loop
        adds    N, #8
        beq     .Lmemxor3_done
 
+       C leftover does an LSB store
+       C so we need to reverse if actually BE
+IF_BE(<        rev     r4, r4>)
+
        C Leftover bytes in a4, low end
        ror     r4, ACNT
 .Lmemxor3_uu_leftover:
@@ -290,18 +305,18 @@ PROLOGUE(nettle_memxor3)
 .Lmemxor3_uud_loop:
        ldr     r5, [AP, #-4]!
        ldr     r7, [BP, #-4]!
-       lsl     r4, ATNC
-       eor     r4, r4, r6, lsl BTNC
-       eor     r4, r4, r5, lsr ACNT
-       eor     r4, r4, r7, lsr BCNT
+       S1ADJ   r4, ATNC
+       eor     r4, r4, r6, S1ADJ BTNC
+       eor     r4, r4, r5, S0ADJ ACNT
+       eor     r4, r4, r7, S0ADJ BCNT
        str     r4, [DST, #-4]!
 .Lmemxor3_uud_odd:
        ldr     r4, [AP, #-4]!
        ldr     r6, [BP, #-4]!
-       lsl     r5, ATNC
-       eor     r5, r5, r7, lsl BTNC
-       eor     r5, r5, r4, lsr ACNT
-       eor     r5, r5, r6, lsr BCNT
+       S1ADJ   r5, ATNC
+       eor     r5, r5, r7, S1ADJ BTNC
+       eor     r5, r5, r4, S0ADJ ACNT
+       eor     r5, r5, r6, S0ADJ BCNT
        str     r5, [DST, #-4]!
        subs    N, #8
        bcs     .Lmemxor3_uud_loop
diff --git a/arm/v6/sha1-compress.asm b/arm/v6/sha1-compress.asm
index 59d6297e..52739b69 100644
--- a/arm/v6/sha1-compress.asm
+++ b/arm/v6/sha1-compress.asm
@@ -52,7 +52,7 @@ define(<LOAD>, <
        sel     W, WPREV, T0
        ror     W, W, SHIFT
        mov     WPREV, T0
-       rev     W, W
+IF_LE(< rev    W, W>)
        str     W, [SP,#eval(4*$1)]
 >)
 define(<EXPN>, <
@@ -127,8 +127,12 @@ PROLOGUE(_nettle_sha1_compress)
        lsl     SHIFT, SHIFT, #3
        mov     T0, #0
        movne   T0, #-1
-       lsl     W, T0, SHIFT
+IF_LE(<        lsl     W, T0, SHIFT>)
+IF_BE(<        lsr     W, T0, SHIFT>)
        uadd8   T0, T0, W               C Sets APSR.GE bits
+       C on BE rotate right by 32-SHIFT bits
+       C because there is no rotate left
+IF_BE(<        rsb     SHIFT, SHIFT, #32>)
        
        ldr     K, .LK1
        ldm     STATE, {SA,SB,SC,SD,SE}
diff --git a/arm/v6/sha256-compress.asm b/arm/v6/sha256-compress.asm
index e6f4e1e9..324730c7 100644
--- a/arm/v6/sha256-compress.asm
+++ b/arm/v6/sha256-compress.asm
@@ -137,8 +137,12 @@ PROLOGUE(_nettle_sha256_compress)
        lsl     SHIFT, SHIFT, #3
        mov     T0, #0
        movne   T0, #-1
-       lsl     I1, T0, SHIFT
+IF_LE(<        lsl     I1, T0, SHIFT>)
+IF_BE(<        lsr     I1, T0, SHIFT>)
        uadd8   T0, T0, I1              C Sets APSR.GE bits
+       C on BE rotate right by 32-SHIFT bits
+       C because there is no rotate left
+IF_BE(<        rsb     SHIFT, SHIFT, #32>)
 
        mov     DST, sp
        mov     ILEFT, #4
@@ -146,16 +150,16 @@ PROLOGUE(_nettle_sha256_compress)
        ldm     INPUT!, {I1,I2,I3,I4}
        sel     I0, I0, I1
        ror     I0, I0, SHIFT
-       rev     I0, I0
+IF_LE(<        rev     I0, I0>)
        sel     I1, I1, I2
        ror     I1, I1, SHIFT
-       rev     I1, I1
+IF_LE(<        rev     I1, I1>)
        sel     I2, I2, I3
        ror     I2, I2, SHIFT
-       rev     I2, I2
+IF_LE(<        rev     I2, I2>)
        sel     I3, I3, I4
        ror     I3, I3, SHIFT
-       rev     I3, I3
+IF_LE(<        rev     I3, I3>)
        subs    ILEFT, ILEFT, #1
        stm     DST!, {I0,I1,I2,I3}
        mov     I0, I4  
diff --git a/asm.m4 b/asm.m4
index 4018c235..343a55fc 100644
--- a/asm.m4
+++ b/asm.m4
@@ -51,6 +51,9 @@ define(<ALIGN>,
 <.align ifelse(ALIGN_LOG,yes,<m4_log2($1)>,$1)
 >)
 
+define(<IF_BE>, <ifelse(WORDS_BIGENDIAN,yes,<$1>,<$2>)>)
+define(<IF_LE>, <IF_BE(<$2>, <$1>)>)
+
 dnl Struct defining macros
 
 dnl STRUCTURE(prefix) 
diff --git a/config.m4.in b/config.m4.in
index e39c880c..11f90a40 100644
--- a/config.m4.in
+++ b/config.m4.in
@@ -7,6 +7,7 @@ define(<TYPE_PROGBITS>, <@ASM_TYPE_PROGBITS@>)dnl
 define(<ALIGN_LOG>, <@ASM_ALIGN_LOG@>)dnl
 define(<W64_ABI>, <@W64_ABI@>)dnl
 define(<RODATA>, <@ASM_RODATA@>)dnl
+define(<WORDS_BIGENDIAN>, <@ASM_WORDS_BIGENDIAN@>)dnl
 divert(1)
 @ASM_MARK_NOEXEC_STACK@
 divert
diff --git a/configure.ac b/configure.ac
index 41bf0998..21eba3b5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -201,7 +201,9 @@ LSH_FUNC_STRERROR
 # getenv_secure is used for fat overrides,
 # getline is used in the testsuite
 AC_CHECK_FUNCS(secure_getenv getline)
-AC_C_BIGENDIAN
+AC_C_BIGENDIAN([AC_DEFINE([WORDS_BIGENDIAN], 1)
+       [ASM_WORDS_BIGENDIAN=yes]],
+       [ASM_WORDS_BIGENDIAN=no])
 
 AC_CACHE_CHECK([for __builtin_bswap64],
                nettle_cv_c_builtin_bswap64,
@@ -811,6 +813,7 @@ AC_SUBST(ASM_TYPE_PROGBITS)
 AC_SUBST(ASM_MARK_NOEXEC_STACK)
 AC_SUBST(ASM_ALIGN_LOG)
 AC_SUBST(W64_ABI)
+AC_SUBST(ASM_WORDS_BIGENDIAN)
 AC_SUBST(EMULATOR)
 
 AC_SUBST(LIBNETTLE_MAJOR)
-- 
2.16.1

-- 
bye, Micha
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: Miscomputation with big-endian arm asm

Reply via email to