Dear gem5 users/devs,

We are using gem5 for some architectural exploration on top of the Fujitsu 
A64FX chip (based on the O3_Arm model).
We are running into a mysterious error, where a relatively simple program fails 
with a page fault on gem5, but it runs fine on the real hardware.
With a few extensions to gem5's gdb stub to make it work in SE mode we narrowed 
down the issue to the following problem.

The application code around the failure is:

   0x00000000004041ec <+1828>:  b.le    0x4046d4 <diff+3084>
   0x00000000004041f0 <+1832>:  and     x14, x10, #0x7
   0x00000000004041f4 <+1836>:  adrp    x13, 0x422000 <matrix+3964>
   0x00000000004041f8 <+1840>:  ldr     x0, [x13, #160]
   0x00000000004041fc <+1844>:  neg     x17, x14
   0x0000000000404200 <+1848>:  str     x17, [sp, #120]
   0x0000000000404204 <+1852>:  ldr     w17, [sp, #104]
   0x0000000000404208 <+1856>:  sxtw    x12, w2

Setting a breakpoint to 0x0000000000404208, we observe the following register 
state:
Breakpoint 2, 0x0000000000404208 in diff () at 
omp-tasks/alignment/alignment_for/alignment.c:338
338           RR[N] = hh = t = t-gh;
(gdb) i r
x0             0x432380            4400000
x1             0x438365            4424549
x2             0x4b                75
x3             0x26                38
x4             0xffffffa6          4294967206
x5             0x0                 0
x6             0x421084            4329604
x7             0x7ffffe0b68        549755685736
x8             0x26                38
x9             0xffffffdb          4294967259
x10            0x26                38
x11            0x0                 0
x12            0xffffffda          4294967258
x13            0x422000            4333568
x14            0x6                 6
x15            0x7ffffcd1f0        549755605488
x16            0x0                 0
x17            0x0                 0
x18            0x7ffffd1fe8        549755625448
x19            0xfffffe4f          4294966863
x20            0xfffff9bb          4294965691
x21            0xfffffd31          4294966577
x22            0x43aeb8            4435640
x23            0xfffffc23          4294966307
x24            0xfffff9bb          4294965691
x25            0x1                 1
x26            0x956               2390
x27            0x25                37
x28            0x7ffffd6ecc        549755645644
x29            0x7ffffdbcec        549755665644
x30            0x0                 0
sp             0x7ffffcd170        0x7ffffcd170
pc             0x404208            0x404208 <diff+1856>
cpsr           0x0                 [ EL=0 ]
fpsr           <unavailable>
fpcr           <unavailable>

As seen from the code above, the previous instruction (at 0x0000000000404204) 
loads $sp+104 into $w17, however, the memory contents and the register value 
differ:
(gdb) p *(int *)($sp+104)
$651 = 1
(gdb) p $w17
$652 = 0

Setting a breakpoint to the same location on real HW we get 1 in $w17 (as 
expected based on the memory content).

Any idea/suggestion what might be causing this and how we could potentially fix 
it?


Thanks and bests,
Balazs

-------------
Balazs Gerofi
Research Scientist
High-Performance Artificial Intelligence Systems Research Team
RIKEN Center for Computational Science, Japan
https://bgerofi.github.io/
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to