This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-go.git
The following commit(s) were added to refs/heads/main by this push:
new 65f1182a perf: optimize ARM64 NEON min/max assembly (#748)
65f1182a is described below
commit 65f1182a39997dd117aa19130640bd2455975862
Author: Matt Topol <[email protected]>
AuthorDate: Wed Apr 15 14:07:07 2026 -0400
perf: optimize ARM64 NEON min/max assembly (#748)
### Rationale for this change
The NEON assembly in `internal/utils/min_max_neon_arm64.s` was
machine-translated from compiler output (via asm2plan9s) and had two
significant inefficiencies:
1. **32-bit functions used half the available NEON register width** —
`.2s` (64-bit D-registers, 2 lanes) instead of `.4s` (128-bit
Q-registers, 4 lanes), leaving half the hardware throughput on the
table.
2. **64-bit functions wasted 4 MOV instructions per loop iteration** —
`BSL` (bit select) is destructive to its mask operand, forcing register
saves before each compare+select. ARM64 provides `BIT`/`BIF` (bit insert
if true/false) which are destructive to the *accumulator* instead,
eliminating the need for saves entirely.
### What changes are included in this PR?
**Assembly optimizations (`min_max_neon_arm64.s`):**
- **32-bit (int32/uint32):** Widen all NEON operations from `.2s` to
`.4s`, processing 8 elements per loop iteration instead of 4. Use
`sminv`/`smaxv`/`uminv`/`umaxv` for single-instruction horizontal
reduction instead of manual `dup` + compare pairs. Adjust loop mask from
`0xfffffffc` (multiples of 4) to `0xfffffff8` (multiples of 8) and
scalar tail threshold from 3 to 7.
- **64-bit (int64/uint64):** Replace `BSL` + 4×`MOV` register saves with
`BIT`/`BIF` instructions. Restructure the 4 independent comparisons to
be grouped together for maximum instruction-level parallelism on
out-of-order cores, followed by 4 independent select operations.
- **Readability:** Replace `LBB0_3` style labels with descriptive names
(`int32_neon`, `int32_loop`, `int32_scalar`, etc.).
**New test file (`min_max_test.go`):**
- Correctness tests for all 4 types (int32, uint32, int64, uint64)
validating NEON results against pure Go implementation across 15
boundary sizes including NEON/scalar transition points (1, 3, 4, 7, 8,
9, 15, 16, 31, 63, 64, 100, 1024).
- Benchmarks for all 4 types at 5 input sizes (64, 256, 1024, 8192,
65536) with throughput reporting.
### Benchmark results (Apple M4, 6 iterations, benchstat):
```
│ before │ after
│
│ sec/op │ sec/op vs base
│
MinMaxInt32/n=64-10 5.992n ± 1% 3.675n ± 0% -38.67% (p=0.002 n=6)
MinMaxInt32/n=256-10 20.80n ± 1% 10.75n ± 1% -48.35% (p=0.002 n=6)
MinMaxInt32/n=1024-10 107.20n ± 0% 50.70n ± 0% -52.71% (p=0.002 n=6)
MinMaxInt32/n=8192-10 921.6n ± 0% 466.5n ± 0% -49.39% (p=0.002 n=6)
MinMaxInt32/n=65536-10 7.570µ ± 1% 3.909µ ± 0% -48.37% (p=0.002 n=6)
MinMaxUint32/n=64-10 6.039n ± 1% 3.694n ± 0% -38.83% (p=0.002 n=6)
MinMaxUint32/n=256-10 21.25n ± 0% 10.89n ± 0% -48.76% (p=0.002 n=6)
MinMaxUint32/n=1024-10 109.75n ± 0% 51.81n ± 0% -52.79% (p=0.002 n=6)
MinMaxUint32/n=8192-10 936.9n ± 0% 474.6n ± 0% -49.34% (p=0.002 n=6)
MinMaxUint32/n=65536-10 7.667µ ± 0% 3.960µ ± 0% -48.36% (p=0.002 n=6)
MinMaxInt64/n=64-10 11.18n ± 0% 11.10n ± 0% -0.72% (p=0.002 n=6)
MinMaxInt64/n=256-10 51.09n ± 0% 50.96n ± 0% -0.24% (p=0.022 n=6)
MinMaxInt64/n=1024-10 233.2n ± 0% 232.2n ± 0% -0.41% (p=0.013 n=6)
MinMaxInt64/n=8192-10 1.917µ ± 0% 1.910µ ± 1% -0.37% (p=0.002 n=6)
MinMaxInt64/n=65536-10 15.59µ ± 0% 15.53µ ± 0% -0.40% (p=0.004 n=6)
MinMaxUint64/n=64-10 11.10n ± 0% 11.06n ± 0% -0.41% (p=0.004 n=6)
MinMaxUint64/n=256-10 51.29n ± 0% 51.11n ± 0% ~ (p=0.052 n=6)
MinMaxUint64/n=1024-10 233.9n ± 1% 233.1n ± 0% ~ (p=0.219 n=6)
MinMaxUint64/n=8192-10 1.929µ ± 0% 1.917µ ± 0% -0.60% (p=0.006 n=6)
MinMaxUint64/n=65536-10 15.65µ ± 0% 15.59µ ± 0% -0.38% (p=0.024 n=6)
geomean 228.5n 164.8n -27.87%
```
**32-bit: ~2× throughput** (38 GB/s → 81 GB/s at n=1024). **Geomean:
-27.9% latency, +38.7% throughput.**
The 64-bit improvement is small (~0.4%) because the M4's out-of-order
engine already absorbs MOV latency via register renaming. On in-order or
narrower cores (Cortex-A55/A76) the BIT/BIF optimization would show a
larger improvement.
### Are these changes tested?
Yes. New correctness tests validate all 4 NEON functions against the
pure Go reference implementation across 15 input sizes that exercise:
- Empty input (length 0)
- Scalar-only paths (length 1–7 for 32-bit, 1–3 for 64-bit)
- Exact NEON boundary (length 8 for 32-bit, length 4 for 64-bit)
- NEON + scalar tail (length 9, 15, 31, 63, 100)
- Pure NEON (length 16, 64, 1024)
Each test forces `MinInt`/`MaxInt` values at random positions to verify
extreme values are handled correctly.
### Are there any user-facing changes?
No API changes. This is a pure performance improvement to internal SIMD
routines used by Parquet statistics computation and Arrow dictionary
operations.
---
internal/utils/min_max_avx2_amd64.go | 16 +-
internal/utils/min_max_neon_arm64.go | 8 +-
internal/utils/min_max_neon_arm64.s | 350 +++++++++++++++++------------------
internal/utils/min_max_sse4_amd64.go | 16 +-
internal/utils/min_max_test.go | 187 +++++++++++++++++++
5 files changed, 375 insertions(+), 202 deletions(-)
diff --git a/internal/utils/min_max_avx2_amd64.go
b/internal/utils/min_max_avx2_amd64.go
index af672624..fbb03b52 100644
--- a/internal/utils/min_max_avx2_amd64.go
+++ b/internal/utils/min_max_avx2_amd64.go
@@ -29,7 +29,7 @@ import (
func _int8_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int8MaxMinAVX2(values []int8) (min, max int8) {
- _int8_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int8_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -37,7 +37,7 @@ func int8MaxMinAVX2(values []int8) (min, max int8) {
func _uint8_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint8MaxMinAVX2(values []uint8) (min, max uint8) {
- _uint8_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint8_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -45,7 +45,7 @@ func uint8MaxMinAVX2(values []uint8) (min, max uint8) {
func _int16_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int16MaxMinAVX2(values []int16) (min, max int16) {
- _int16_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int16_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -53,7 +53,7 @@ func int16MaxMinAVX2(values []int16) (min, max int16) {
func _uint16_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint16MaxMinAVX2(values []uint16) (min, max uint16) {
- _uint16_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint16_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -61,7 +61,7 @@ func uint16MaxMinAVX2(values []uint16) (min, max uint16) {
func _int32_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int32MaxMinAVX2(values []int32) (min, max int32) {
- _int32_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int32_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -69,7 +69,7 @@ func int32MaxMinAVX2(values []int32) (min, max int32) {
func _uint32_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint32MaxMinAVX2(values []uint32) (min, max uint32) {
- _uint32_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint32_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -77,7 +77,7 @@ func uint32MaxMinAVX2(values []uint32) (min, max uint32) {
func _int64_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int64MaxMinAVX2(values []int64) (min, max int64) {
- _int64_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int64_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -85,6 +85,6 @@ func int64MaxMinAVX2(values []int64) (min, max int64) {
func _uint64_max_min_avx2(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint64MaxMinAVX2(values []uint64) (min, max uint64) {
- _uint64_max_min_avx2(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint64_max_min_avx2(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
diff --git a/internal/utils/min_max_neon_arm64.go
b/internal/utils/min_max_neon_arm64.go
index f9d3c44e..043201ad 100644
--- a/internal/utils/min_max_neon_arm64.go
+++ b/internal/utils/min_max_neon_arm64.go
@@ -27,7 +27,7 @@ import "unsafe"
func _int32_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int32MaxMinNEON(values []int32) (min, max int32) {
- _int32_max_min_neon(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int32_max_min_neon(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -35,7 +35,7 @@ func int32MaxMinNEON(values []int32) (min, max int32) {
func _uint32_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint32MaxMinNEON(values []uint32) (min, max uint32) {
- _uint32_max_min_neon(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint32_max_min_neon(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -43,7 +43,7 @@ func uint32MaxMinNEON(values []uint32) (min, max uint32) {
func _int64_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int64MaxMinNEON(values []int64) (min, max int64) {
- _int64_max_min_neon(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int64_max_min_neon(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -51,6 +51,6 @@ func int64MaxMinNEON(values []int64) (min, max int64) {
func _uint64_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint64MaxMinNEON(values []uint64) (min, max uint64) {
- _uint64_max_min_neon(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint64_max_min_neon(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
diff --git a/internal/utils/min_max_neon_arm64.s
b/internal/utils/min_max_neon_arm64.s
index a31c5d2e..078971d1 100644
--- a/internal/utils/min_max_neon_arm64.s
+++ b/internal/utils/min_max_neon_arm64.s
@@ -1,9 +1,8 @@
//+build !noasm !appengine
-// ARROW-15336
-// (C2GOASM doesn't work correctly for Arm64)
-// Partly GENERATED BY asm2plan9s.
-
+// ARROW-15336: optimized NEON min/max for ARM64
+// 32-bit functions use .4s (128-bit Q registers, 4 lanes) processing 8
elements/iteration
+// 64-bit functions use BIT/BIF instead of BSL+MOV to eliminate register saves
// func _int32_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
TEXT ·_int32_max_min_neon(SB), $0-32
@@ -13,76 +12,74 @@ TEXT ·_int32_max_min_neon(SB), $0-32
MOVD minout+16(FP), R2
MOVD maxout+24(FP), R3
- // The Go ABI saves the frame pointer register one word below the
- // caller's frame. Make room so we don't overwrite it. Needs to stay
- // 16-byte aligned
+ // The Go ABI saves the frame pointer register one word below the
+ // caller's frame. Make room so we don't overwrite it. Needs to stay
+ // 16-byte aligned
SUB $16, RSP
- WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
+ WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
WORD $0x7100043f // cmp w1, #1
WORD $0x910003fd // mov x29, sp
- BLT LBB0_3
+ BLT int32_early_exit
- WORD $0x71000c3f // cmp w1, #3
+ WORD $0x71001c3f // cmp w1, #7
WORD $0x2a0103e8 // mov w8, w1
- BHI LBB0_4
+ BHI int32_neon
WORD $0xaa1f03e9 // mov x9, xzr
WORD $0x52b0000b // mov w11, #-2147483648
WORD $0x12b0000a // mov w10, #2147483647
- JMP LBB0_7
-LBB0_3:
+ JMP int32_scalar
+int32_early_exit:
WORD $0x12b0000a // mov w10, #2147483647
WORD $0x52b0000b // mov w11, #-2147483648
WORD $0xb900006b // str w11, [x3]
WORD $0xb900004a // str w10, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
-LBB0_4:
- WORD $0x927e7509 // and x9, x8, #0xfffffffc
- WORD $0x9100200a // add x10, x0, #8
- WORD $0x0f046402 // movi v2.2s, #128, lsl #24
- WORD $0x2f046400 // mvni v0.2s, #128, lsl #24
- WORD $0x2f046401 // mvni v1.2s, #128, lsl #24
+int32_neon:
+ WORD $0x927d7109 // and x9, x8, #0xfffffff8
+ WORD $0x9100400a // add x10, x0, #16
+ WORD $0x4f046402 // movi v2.4s, #128, lsl #24
+ WORD $0x6f046400 // mvni v0.4s, #128, lsl #24
+ WORD $0x6f046401 // mvni v1.4s, #128, lsl #24
WORD $0xaa0903eb // mov x11, x9
- WORD $0x0f046403 // movi v3.2s, #128, lsl #24
-LBB0_5:
- WORD $0x6d7f9544 // ldp d4, d5, [x10, #-8]
- WORD $0xf100116b // subs x11, x11, #4
- WORD $0x9100414a // add x10, x10, #16
- WORD $0x0ea46c00 // smin v0.2s, v0.2s, v4.2s
- WORD $0x0ea56c21 // smin v1.2s, v1.2s, v5.2s
- WORD $0x0ea46442 // smax v2.2s, v2.2s, v4.2s
- WORD $0x0ea56463 // smax v3.2s, v3.2s, v5.2s
- BNE LBB0_5
+ WORD $0x4f046403 // movi v3.4s, #128, lsl #24
+int32_loop:
+ WORD $0xad7f9544 // ldp q4, q5, [x10, #-16]
+ WORD $0xf100216b // subs x11, x11, #8
+ WORD $0x9100814a // add x10, x10, #32
+ WORD $0x4ea46c00 // smin v0.4s, v0.4s, v4.4s
+ WORD $0x4ea56c21 // smin v1.4s, v1.4s, v5.4s
+ WORD $0x4ea46442 // smax v2.4s, v2.4s, v4.4s
+ WORD $0x4ea56463 // smax v3.4s, v3.4s, v5.4s
+ BNE int32_loop
- WORD $0x0ea36442 // smax v2.2s, v2.2s, v3.2s
- WORD $0x0ea16c00 // smin v0.2s, v0.2s, v1.2s
- WORD $0x0e0c0441 // dup v1.2s, v2.s[1]
- WORD $0x0e0c0403 // dup v3.2s, v0.s[1]
- WORD $0x0ea16441 // smax v1.2s, v2.2s, v1.2s
- WORD $0x0ea36c00 // smin v0.2s, v0.2s, v3.2s
+ WORD $0x4ea36442 // smax v2.4s, v2.4s, v3.4s
+ WORD $0x4ea16c00 // smin v0.4s, v0.4s, v1.4s
+ WORD $0x4eb0a842 // smaxv s2, v2.4s
+ WORD $0x4eb1a800 // sminv s0, v0.4s
WORD $0xeb08013f // cmp x9, x8
- WORD $0x1e26002b // fmov w11, s1
- WORD $0x1e26000a // fmov w10, s0
- BEQ LBB0_9
-LBB0_7:
+ WORD $0x1e26004b // fmov w11, s2
+ WORD $0x1e26000a // fmov w10, s0
+ BEQ int32_done
+int32_scalar:
WORD $0x8b09080c // add x12, x0, x9, lsl #2
WORD $0xcb090108 // sub x8, x8, x9
-LBB0_8:
+int32_scalar_loop:
WORD $0xb8404589 // ldr w9, [x12], #4
WORD $0x6b09015f // cmp w10, w9
- WORD $0x1a89b14a // csel w10, w10, w9, lt
+ WORD $0x1a89b14a // csel w10, w10, w9, lt
WORD $0x6b09017f // cmp w11, w9
- WORD $0x1a89c16b // csel w11, w11, w9, gt
- WORD $0xf1000508 // subs x8, x8, #1
- BNE LBB0_8
-LBB0_9:
+ WORD $0x1a89c16b // csel w11, w11, w9, gt
+ WORD $0xf1000508 // subs x8, x8, #1
+ BNE int32_scalar_loop
+int32_done:
WORD $0xb900006b // str w11, [x3]
WORD $0xb900004a // str w10, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
@@ -93,115 +90,113 @@ TEXT ·_uint32_max_min_neon(SB), $0-32
MOVD length+8(FP), R1
MOVD minout+16(FP), R2
MOVD maxout+24(FP), R3
-
- // The Go ABI saves the frame pointer register one word below the
- // caller's frame. Make room so we don't overwrite it. Needs to stay
- // 16-byte aligned
+
+ // The Go ABI saves the frame pointer register one word below the
+ // caller's frame. Make room so we don't overwrite it. Needs to stay
+ // 16-byte aligned
SUB $16, RSP
- WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
+ WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
WORD $0x7100043f // cmp w1, #1
WORD $0x910003fd // mov x29, sp
- BLT LBB1_3
+ BLT uint32_early_exit
- WORD $0x71000c3f // cmp w1, #3
+ WORD $0x71001c3f // cmp w1, #7
WORD $0x2a0103e8 // mov w8, w1
- BHI LBB1_4
+ BHI uint32_neon
WORD $0xaa1f03e9 // mov x9, xzr
WORD $0x2a1f03ea // mov w10, wzr
WORD $0x1280000b // mov w11, #-1
- JMP LBB1_7
-LBB1_3:
+ JMP uint32_scalar
+uint32_early_exit:
WORD $0x2a1f03ea // mov w10, wzr
WORD $0x1280000b // mov w11, #-1
WORD $0xb900006a // str w10, [x3]
WORD $0xb900004b // str w11, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
-LBB1_4:
- WORD $0x927e7509 // and x9, x8, #0xfffffffc
- WORD $0x6f00e401 // movi v1.2d, #0000000000000000
- WORD $0x6f07e7e0 // movi v0.2d, #0xffffffffffffffff
- WORD $0x9100200a // add x10, x0, #8
- WORD $0x6f07e7e2 // movi v2.2d, #0xffffffffffffffff
+uint32_neon:
+ WORD $0x927d7109 // and x9, x8, #0xfffffff8
+ WORD $0x6f00e401 // movi v1.2d, #0000000000000000
+ WORD $0x6f07e7e0 // movi v0.2d, #0xffffffffffffffff
+ WORD $0x9100400a // add x10, x0, #16
+ WORD $0x6f07e7e2 // movi v2.2d, #0xffffffffffffffff
WORD $0xaa0903eb // mov x11, x9
- WORD $0x6f00e403 // movi v3.2d, #0000000000000000
-LBB1_5:
- WORD $0x6d7f9544 // ldp d4, d5, [x10, #-8]
- WORD $0xf100116b // subs x11, x11, #4
- WORD $0x9100414a // add x10, x10, #16
- WORD $0x2ea46c00 // umin v0.2s, v0.2s, v4.2s
- WORD $0x2ea56c42 // umin v2.2s, v2.2s, v5.2s
- WORD $0x2ea46421 // umax v1.2s, v1.2s, v4.2s
- WORD $0x2ea56463 // umax v3.2s, v3.2s, v5.2s
- BNE LBB1_5
+ WORD $0x6f00e403 // movi v3.2d, #0000000000000000
+uint32_loop:
+ WORD $0xad7f9544 // ldp q4, q5, [x10, #-16]
+ WORD $0xf100216b // subs x11, x11, #8
+ WORD $0x9100814a // add x10, x10, #32
+ WORD $0x6ea46c00 // umin v0.4s, v0.4s, v4.4s
+ WORD $0x6ea56c42 // umin v2.4s, v2.4s, v5.4s
+ WORD $0x6ea46421 // umax v1.4s, v1.4s, v4.4s
+ WORD $0x6ea56463 // umax v3.4s, v3.4s, v5.4s
+ BNE uint32_loop
- WORD $0x2ea36421 // umax v1.2s, v1.2s, v3.2s
- WORD $0x2ea26c00 // umin v0.2s, v0.2s, v2.2s
- WORD $0x0e0c0422 // dup v2.2s, v1.s[1]
- WORD $0x0e0c0403 // dup v3.2s, v0.s[1]
- WORD $0x2ea26421 // umax v1.2s, v1.2s, v2.2s
- WORD $0x2ea36c00 // umin v0.2s, v0.2s, v3.2s
+ WORD $0x6ea36421 // umax v1.4s, v1.4s, v3.4s
+ WORD $0x6ea26c00 // umin v0.4s, v0.4s, v2.4s
+ WORD $0x6eb0a821 // umaxv s1, v1.4s
+ WORD $0x6eb1a800 // uminv s0, v0.4s
WORD $0xeb08013f // cmp x9, x8
- WORD $0x1e26002a // fmov w10, s1
- WORD $0x1e26000b // fmov w11, s0
- BEQ LBB1_9
-LBB1_7:
+ WORD $0x1e26002a // fmov w10, s1
+ WORD $0x1e26000b // fmov w11, s0
+ BEQ uint32_done
+uint32_scalar:
WORD $0x8b09080c // add x12, x0, x9, lsl #2
WORD $0xcb090108 // sub x8, x8, x9
-LBB1_8:
+uint32_scalar_loop:
WORD $0xb8404589 // ldr w9, [x12], #4
WORD $0x6b09017f // cmp w11, w9
- WORD $0x1a89316b // csel w11, w11, w9, lo
+ WORD $0x1a89316b // csel w11, w11, w9, lo
WORD $0x6b09015f // cmp w10, w9
- WORD $0x1a89814a // csel w10, w10, w9, hi
- WORD $0xf1000508 // subs x8, x8, #1
- BNE LBB1_8
-LBB1_9:
+ WORD $0x1a89814a // csel w10, w10, w9, hi
+ WORD $0xf1000508 // subs x8, x8, #1
+ BNE uint32_scalar_loop
+uint32_done:
WORD $0xb900006a // str w10, [x3]
WORD $0xb900004b // str w11, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
// func _int64_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
TEXT ·_int64_max_min_neon(SB), $0-32
- MOVD values+0(FP), R0
- MOVD length+8(FP), R1
- MOVD minout+16(FP), R2
- MOVD maxout+24(FP), R3
+ MOVD values+0(FP), R0
+ MOVD length+8(FP), R1
+ MOVD minout+16(FP), R2
+ MOVD maxout+24(FP), R3
- // The Go ABI saves the frame pointer register one word below the
- // caller's frame. Make room so we don't overwrite it. Needs to stay
- // 16-byte aligned
+ // The Go ABI saves the frame pointer register one word below the
+ // caller's frame. Make room so we don't overwrite it. Needs to stay
+ // 16-byte aligned
SUB $16, RSP
WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
WORD $0x7100043f // cmp w1, #1
WORD $0x910003fd // mov x29, sp
- BLT LBB2_3
+ BLT int64_early_exit
WORD $0x2a0103e8 // mov w8, w1
WORD $0xd2f0000b // mov x11, #-9223372036854775808
WORD $0x71000c3f // cmp w1, #3
WORD $0x92f0000a // mov x10, #9223372036854775807
- BHI LBB2_4
+ BHI int64_neon
WORD $0xaa1f03e9 // mov x9, xzr
- JMP LBB2_7
-LBB2_3:
+ JMP int64_scalar
+int64_early_exit:
WORD $0x92f0000a // mov x10, #9223372036854775807
WORD $0xd2f0000b // mov x11, #-9223372036854775808
WORD $0xf900006b // str x11, [x3]
WORD $0xf900004a // str x10, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
-LBB2_4:
+int64_neon:
WORD $0x927e7509 // and x9, x8, #0xfffffffc
WORD $0x4e080d61 // dup v1.2d, x11
WORD $0x4e080d40 // dup v0.2d, x10
@@ -209,54 +204,50 @@ LBB2_4:
WORD $0xaa0903eb // mov x11, x9
WORD $0x4ea01c02 // mov v2.16b, v0.16b
WORD $0x4ea11c23 // mov v3.16b, v1.16b
-LBB2_5:
+int64_loop:
WORD $0xad7f9544 // ldp q4, q5, [x10, #-16]
- WORD $0x4ea31c66 // mov v6.16b, v3.16b
- WORD $0x4ea11c27 // mov v7.16b, v1.16b
- WORD $0x4ea21c43 // mov v3.16b, v2.16b
- WORD $0x4ea01c01 // mov v1.16b, v0.16b
- WORD $0x4ee03480 // cmgt v0.2d, v4.2d, v0.2d
- WORD $0x4ee234a2 // cmgt v2.2d, v5.2d, v2.2d
- WORD $0x6e641c20 // bsl v0.16b, v1.16b, v4.16b
- WORD $0x4ee434e1 // cmgt v1.2d, v7.2d, v4.2d
- WORD $0x6e651c62 // bsl v2.16b, v3.16b, v5.16b
- WORD $0x4ee534c3 // cmgt v3.2d, v6.2d, v5.2d
- WORD $0xf100116b // subs x11, x11, #4
- WORD $0x6e641ce1 // bsl v1.16b, v7.16b, v4.16b
- WORD $0x6e651cc3 // bsl v3.16b, v6.16b, v5.16b
+ WORD $0x4ee03486 // cmgt v6.2d, v4.2d, v0.2d
+ WORD $0x4ee234a7 // cmgt v7.2d, v5.2d, v2.2d
+ WORD $0x4ee13490 // cmgt v16.2d, v4.2d, v1.2d
+ WORD $0x4ee334b1 // cmgt v17.2d, v5.2d, v3.2d
+ WORD $0x6ee61c80 // bif v0.16b, v4.16b, v6.16b
+ WORD $0x6ee71ca2 // bif v2.16b, v5.16b, v7.16b
+ WORD $0x6eb01c81 // bit v1.16b, v4.16b, v16.16b
+ WORD $0x6eb11ca3 // bit v3.16b, v5.16b, v17.16b
+ WORD $0xf100116b // subs x11, x11, #4
WORD $0x9100814a // add x10, x10, #32
- BNE LBB2_5
+ BNE int64_loop
- WORD $0x4ee33424 // cmgt v4.2d, v1.2d, v3.2d
- WORD $0x4ee03445 // cmgt v5.2d, v2.2d, v0.2d
+ WORD $0x4ee33424 // cmgt v4.2d, v1.2d, v3.2d
+ WORD $0x4ee03445 // cmgt v5.2d, v2.2d, v0.2d
WORD $0x6e631c24 // bsl v4.16b, v1.16b, v3.16b
WORD $0x6e621c05 // bsl v5.16b, v0.16b, v2.16b
WORD $0x4e180480 // dup v0.2d, v4.d[1]
WORD $0x4e1804a1 // dup v1.2d, v5.d[1]
- WORD $0x4ee03482 // cmgt v2.2d, v4.2d, v0.2d
- WORD $0x4ee53423 // cmgt v3.2d, v1.2d, v5.2d
+ WORD $0x4ee03482 // cmgt v2.2d, v4.2d, v0.2d
+ WORD $0x4ee53423 // cmgt v3.2d, v1.2d, v5.2d
WORD $0x6e601c82 // bsl v2.16b, v4.16b, v0.16b
WORD $0x6e611ca3 // bsl v3.16b, v5.16b, v1.16b
WORD $0xeb08013f // cmp x9, x8
- WORD $0x9e66004b // fmov x11, d2
- WORD $0x9e66006a // fmov x10, d3
- BEQ LBB2_9
-LBB2_7:
+ WORD $0x9e66004b // fmov x11, d2
+ WORD $0x9e66006a // fmov x10, d3
+ BEQ int64_done
+int64_scalar:
WORD $0x8b090c0c // add x12, x0, x9, lsl #3
WORD $0xcb090108 // sub x8, x8, x9
-LBB2_8:
+int64_scalar_loop:
WORD $0xf8408589 // ldr x9, [x12], #8
WORD $0xeb09015f // cmp x10, x9
- WORD $0x9a89b14a // csel x10, x10, x9, lt
+ WORD $0x9a89b14a // csel x10, x10, x9, lt
WORD $0xeb09017f // cmp x11, x9
- WORD $0x9a89c16b // csel x11, x11, x9, gt
- WORD $0xf1000508 // subs x8, x8, #1
- BNE LBB2_8
-LBB2_9:
+ WORD $0x9a89c16b // csel x11, x11, x9, gt
+ WORD $0xf1000508 // subs x8, x8, #1
+ BNE int64_scalar_loop
+int64_done:
WORD $0xf900006b // str x11, [x3]
WORD $0xf900004a // str x10, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
@@ -264,93 +255,88 @@ LBB2_9:
// func _uint64_max_min_neon(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
TEXT ·_uint64_max_min_neon(SB), $0-32
- MOVD values+0(FP), R0
- MOVD length+8(FP), R1
- MOVD minout+16(FP), R2
- MOVD maxout+24(FP), R3
+ MOVD values+0(FP), R0
+ MOVD length+8(FP), R1
+ MOVD minout+16(FP), R2
+ MOVD maxout+24(FP), R3
- // The Go ABI saves the frame pointer register one word below the
- // caller's frame. Make room so we don't overwrite it. Needs to stay
- // 16-byte aligned
+ // The Go ABI saves the frame pointer register one word below the
+ // caller's frame. Make room so we don't overwrite it. Needs to stay
+ // 16-byte aligned
SUB $16, RSP
WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]!
WORD $0x7100043f // cmp w1, #1
WORD $0x910003fd // mov x29, sp
- BLT LBB3_3
+ BLT uint64_early_exit
WORD $0x71000c3f // cmp w1, #3
WORD $0x2a0103e8 // mov w8, w1
- BHI LBB3_4
+ BHI uint64_neon
WORD $0xaa1f03e9 // mov x9, xzr
WORD $0xaa1f03ea // mov x10, xzr
WORD $0x9280000b // mov x11, #-1
- JMP LBB3_7
-LBB3_3:
+ JMP uint64_scalar
+uint64_early_exit:
WORD $0xaa1f03ea // mov x10, xzr
WORD $0x9280000b // mov x11, #-1
WORD $0xf900006a // str x10, [x3]
WORD $0xf900004b // str x11, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
-LBB3_4:
+uint64_neon:
WORD $0x927e7509 // and x9, x8, #0xfffffffc
WORD $0x9100400a // add x10, x0, #16
- WORD $0x6f00e401 // movi v1.2d, #0000000000000000
- WORD $0x6f07e7e0 // movi v0.2d, #0xffffffffffffffff
- WORD $0x6f07e7e2 // movi v2.2d, #0xffffffffffffffff
+ WORD $0x6f00e401 // movi v1.2d, #0000000000000000
+ WORD $0x6f07e7e0 // movi v0.2d, #0xffffffffffffffff
+ WORD $0x6f07e7e2 // movi v2.2d, #0xffffffffffffffff
WORD $0xaa0903eb // mov x11, x9
- WORD $0x6f00e403 // movi v3.2d, #0000000000000000
-LBB3_5:
+ WORD $0x6f00e403 // movi v3.2d, #0000000000000000
+uint64_loop:
WORD $0xad7f9544 // ldp q4, q5, [x10, #-16]
- WORD $0x4ea31c66 // mov v6.16b, v3.16b
- WORD $0x4ea11c27 // mov v7.16b, v1.16b
- WORD $0x4ea21c43 // mov v3.16b, v2.16b
- WORD $0x4ea01c01 // mov v1.16b, v0.16b
- WORD $0x6ee03480 // cmhi v0.2d, v4.2d, v0.2d
- WORD $0x6ee234a2 // cmhi v2.2d, v5.2d, v2.2d
- WORD $0x6e641c20 // bsl v0.16b, v1.16b, v4.16b
- WORD $0x6ee434e1 // cmhi v1.2d, v7.2d, v4.2d
- WORD $0x6e651c62 // bsl v2.16b, v3.16b, v5.16b
- WORD $0x6ee534c3 // cmhi v3.2d, v6.2d, v5.2d
- WORD $0xf100116b // subs x11, x11, #4
- WORD $0x6e641ce1 // bsl v1.16b, v7.16b, v4.16b
- WORD $0x6e651cc3 // bsl v3.16b, v6.16b, v5.16b
+ WORD $0x6ee03486 // cmhi v6.2d, v4.2d, v0.2d
+ WORD $0x6ee234a7 // cmhi v7.2d, v5.2d, v2.2d
+ WORD $0x6ee13490 // cmhi v16.2d, v4.2d, v1.2d
+ WORD $0x6ee334b1 // cmhi v17.2d, v5.2d, v3.2d
+ WORD $0x6ee61c80 // bif v0.16b, v4.16b, v6.16b
+ WORD $0x6ee71ca2 // bif v2.16b, v5.16b, v7.16b
+ WORD $0x6eb01c81 // bit v1.16b, v4.16b, v16.16b
+ WORD $0x6eb11ca3 // bit v3.16b, v5.16b, v17.16b
+ WORD $0xf100116b // subs x11, x11, #4
WORD $0x9100814a // add x10, x10, #32
- BNE LBB3_5
+ BNE uint64_loop
- WORD $0x6ee33424 // cmhi v4.2d, v1.2d, v3.2d
- WORD $0x6ee03445 // cmhi v5.2d, v2.2d, v0.2d
+ WORD $0x6ee33424 // cmhi v4.2d, v1.2d, v3.2d
+ WORD $0x6ee03445 // cmhi v5.2d, v2.2d, v0.2d
WORD $0x6e631c24 // bsl v4.16b, v1.16b, v3.16b
WORD $0x6e621c05 // bsl v5.16b, v0.16b, v2.16b
WORD $0x4e180480 // dup v0.2d, v4.d[1]
WORD $0x4e1804a1 // dup v1.2d, v5.d[1]
- WORD $0x6ee03482 // cmhi v2.2d, v4.2d, v0.2d
- WORD $0x6ee53423 // cmhi v3.2d, v1.2d, v5.2d
+ WORD $0x6ee03482 // cmhi v2.2d, v4.2d, v0.2d
+ WORD $0x6ee53423 // cmhi v3.2d, v1.2d, v5.2d
WORD $0x6e601c82 // bsl v2.16b, v4.16b, v0.16b
WORD $0x6e611ca3 // bsl v3.16b, v5.16b, v1.16b
WORD $0xeb08013f // cmp x9, x8
- WORD $0x9e66004a // fmov x10, d2
- WORD $0x9e66006b // fmov x11, d3
- BEQ LBB3_9
-LBB3_7:
+ WORD $0x9e66004a // fmov x10, d2
+ WORD $0x9e66006b // fmov x11, d3
+ BEQ uint64_done
+uint64_scalar:
WORD $0x8b090c0c // add x12, x0, x9, lsl #3
WORD $0xcb090108 // sub x8, x8, x9
-LBB3_8:
+uint64_scalar_loop:
WORD $0xf8408589 // ldr x9, [x12], #8
WORD $0xeb09017f // cmp x11, x9
- WORD $0x9a89316b // csel x11, x11, x9, lo
+ WORD $0x9a89316b // csel x11, x11, x9, lo
WORD $0xeb09015f // cmp x10, x9
- WORD $0x9a89814a // csel x10, x10, x9, hi
- WORD $0xf1000508 // subs x8, x8, #1
- BNE LBB3_8
-LBB3_9:
+ WORD $0x9a89814a // csel x10, x10, x9, hi
+ WORD $0xf1000508 // subs x8, x8, #1
+ BNE uint64_scalar_loop
+uint64_done:
WORD $0xf900006a // str x10, [x3]
WORD $0xf900004b // str x11, [x2]
WORD $0xa8c17bfd // ldp x29, x30, [sp], #16
- // Put the stack pointer back where it was
+ // Put the stack pointer back where it was
ADD $16, RSP
RET
-
diff --git a/internal/utils/min_max_sse4_amd64.go
b/internal/utils/min_max_sse4_amd64.go
index 1e12a8d1..536cb1ad 100644
--- a/internal/utils/min_max_sse4_amd64.go
+++ b/internal/utils/min_max_sse4_amd64.go
@@ -27,7 +27,7 @@ import "unsafe"
func _int8_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int8MaxMinSSE4(values []int8) (min, max int8) {
- _int8_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int8_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -35,7 +35,7 @@ func int8MaxMinSSE4(values []int8) (min, max int8) {
func _uint8_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint8MaxMinSSE4(values []uint8) (min, max uint8) {
- _uint8_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint8_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -43,7 +43,7 @@ func uint8MaxMinSSE4(values []uint8) (min, max uint8) {
func _int16_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int16MaxMinSSE4(values []int16) (min, max int16) {
- _int16_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int16_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -51,7 +51,7 @@ func int16MaxMinSSE4(values []int16) (min, max int16) {
func _uint16_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint16MaxMinSSE4(values []uint16) (min, max uint16) {
- _uint16_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint16_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -59,7 +59,7 @@ func uint16MaxMinSSE4(values []uint16) (min, max uint16) {
func _int32_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int32MaxMinSSE4(values []int32) (min, max int32) {
- _int32_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int32_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -67,7 +67,7 @@ func int32MaxMinSSE4(values []int32) (min, max int32) {
func _uint32_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint32MaxMinSSE4(values []uint32) (min, max uint32) {
- _uint32_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint32_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -75,7 +75,7 @@ func uint32MaxMinSSE4(values []uint32) (min, max uint32) {
func _int64_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func int64MaxMinSSE4(values []int64) (min, max int64) {
- _int64_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _int64_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
@@ -83,6 +83,6 @@ func int64MaxMinSSE4(values []int64) (min, max int64) {
func _uint64_max_min_sse4(values unsafe.Pointer, length int, minout, maxout
unsafe.Pointer)
func uint64MaxMinSSE4(values []uint64) (min, max uint64) {
- _uint64_max_min_sse4(unsafe.Pointer(&values[0]), len(values),
unsafe.Pointer(&min), unsafe.Pointer(&max))
+ _uint64_max_min_sse4(unsafe.Pointer(unsafe.SliceData(values)),
len(values), unsafe.Pointer(&min), unsafe.Pointer(&max))
return
}
diff --git a/internal/utils/min_max_test.go b/internal/utils/min_max_test.go
new file mode 100644
index 00000000..59da6910
--- /dev/null
+++ b/internal/utils/min_max_test.go
@@ -0,0 +1,187 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package utils
+
+import (
+ "fmt"
+ "math"
+ "math/rand/v2"
+ "testing"
+)
+
+func TestMinMaxInt32(t *testing.T) {
+ for _, size := range []int{0, 1, 2, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64,
100, 1024} {
+ t.Run(fmt.Sprintf("n=%d", size), func(t *testing.T) {
+ r := rand.New(&rand.PCG{}) // zero-seed for
reproducibility
+ values := make([]int32, size)
+ for i := range values {
+ values[i] = r.Int32() - math.MaxInt32/2
+ }
+ if size > 0 {
+ values[r.IntN(size)] = math.MinInt32
+ values[r.IntN(size)] = math.MaxInt32
+ }
+
+ goMin, goMax := int32MinMax(values)
+ min, max := GetMinMaxInt32(values)
+ if min != goMin || max != goMax {
+ t.Errorf("n=%d: got min=%d max=%d, want min=%d
max=%d", size, min, max, goMin, goMax)
+ }
+ })
+ }
+}
+
+func TestMinMaxUint32(t *testing.T) {
+ for _, size := range []int{0, 1, 2, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64,
100, 1024} {
+ t.Run(fmt.Sprintf("n=%d", size), func(t *testing.T) {
+ values := make([]uint32, size)
+ r := rand.New(&rand.PCG{}) // zero-seed for
reproducibility
+ for i := range values {
+ values[i] = r.Uint32()
+ }
+ if size > 0 {
+ values[r.IntN(size)] = 0
+ values[r.IntN(size)] = math.MaxUint32
+ }
+
+ goMin, goMax := uint32MinMax(values)
+ min, max := GetMinMaxUint32(values)
+ if min != goMin || max != goMax {
+ t.Errorf("n=%d: got min=%d max=%d, want min=%d
max=%d", size, min, max, goMin, goMax)
+ }
+ })
+ }
+}
+
+func TestMinMaxInt64(t *testing.T) {
+ for _, size := range []int{0, 1, 2, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64,
100, 1024} {
+ t.Run(fmt.Sprintf("n=%d", size), func(t *testing.T) {
+ r := rand.New(&rand.PCG{}) // zero-seed for
reproducibility
+ values := make([]int64, size)
+ for i := range values {
+ values[i] = r.Int64() - math.MaxInt64/2
+ }
+ if size > 0 {
+ values[r.IntN(size)] = math.MinInt64
+ values[r.IntN(size)] = math.MaxInt64
+ }
+
+ goMin, goMax := int64MinMax(values)
+ min, max := GetMinMaxInt64(values)
+ if min != goMin || max != goMax {
+ t.Errorf("n=%d: got min=%d max=%d, want min=%d
max=%d", size, min, max, goMin, goMax)
+ }
+ })
+ }
+}
+
+func TestMinMaxUint64(t *testing.T) {
+ for _, size := range []int{0, 1, 2, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64,
100, 1024} {
+ t.Run(fmt.Sprintf("n=%d", size), func(t *testing.T) {
+ r := rand.New(&rand.PCG{}) // zero-seed for
reproducibility
+ values := make([]uint64, size)
+ for i := range values {
+ values[i] = r.Uint64()
+ }
+ if size > 0 {
+ values[r.IntN(size)] = 0
+ values[r.IntN(size)] = math.MaxUint64
+ }
+
+ goMin, goMax := uint64MinMax(values)
+ min, max := GetMinMaxUint64(values)
+ if min != goMin || max != goMax {
+ t.Errorf("n=%d: got min=%d max=%d, want min=%d
max=%d", size, min, max, goMin, goMax)
+ }
+ })
+ }
+}
+
+var (
+ benchMinI32 int32
+ benchMaxI32 int32
+ benchMinU32 uint32
+ benchMaxU32 uint32
+ benchMinI64 int64
+ benchMaxI64 int64
+ benchMinU64 uint64
+ benchMaxU64 uint64
+)
+
+func BenchmarkMinMaxInt32(b *testing.B) {
+ for _, size := range []int{64, 256, 1024, 8192, 65536} {
+ values := make([]int32, size)
+ r := rand.New(&rand.PCG{}) // zero-seed for reproducibility
+ for i := range values {
+ values[i] = r.Int32() - math.MaxInt32/2
+ }
+ b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
+ b.SetBytes(int64(size) * 4)
+ for i := 0; i < b.N; i++ {
+ benchMinI32, benchMaxI32 =
GetMinMaxInt32(values)
+ }
+ })
+ }
+}
+
+func BenchmarkMinMaxUint32(b *testing.B) {
+ for _, size := range []int{64, 256, 1024, 8192, 65536} {
+ values := make([]uint32, size)
+ r := rand.New(&rand.PCG{}) // zero-seed for reproducibility
+ for i := range values {
+ values[i] = r.Uint32()
+ }
+ b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
+ b.SetBytes(int64(size) * 4)
+ for i := 0; i < b.N; i++ {
+ benchMinU32, benchMaxU32 =
GetMinMaxUint32(values)
+ }
+ })
+ }
+}
+
+func BenchmarkMinMaxInt64(b *testing.B) {
+ for _, size := range []int{64, 256, 1024, 8192, 65536} {
+ values := make([]int64, size)
+ r := rand.New(&rand.PCG{}) // zero-seed for reproducibility
+ for i := range values {
+ values[i] = r.Int64() - math.MaxInt64/2
+ }
+ b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
+ b.SetBytes(int64(size) * 8)
+ for i := 0; i < b.N; i++ {
+ benchMinI64, benchMaxI64 =
GetMinMaxInt64(values)
+ }
+ })
+ }
+}
+
+func BenchmarkMinMaxUint64(b *testing.B) {
+ for _, size := range []int{64, 256, 1024, 8192, 65536} {
+ values := make([]uint64, size)
+ r := rand.New(&rand.PCG{}) // zero-seed for reproducibility
+ for i := range values {
+ values[i] = r.Uint64()
+ }
+ b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
+ b.SetBytes(int64(size) * 8)
+ for i := 0; i < b.N; i++ {
+ benchMinU64, benchMaxU64 =
GetMinMaxUint64(values)
+ }
+ })
+ }
+}