Btw, I have also tried that if I don't write checkpoint, just use KVM CPU to 
boot and then run lulesh completely, it can be successful:


/mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py 
--disk-image=./x86-64-system/disks/base.img 
--kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 
--cpu-type=X86KvmCPU --mem-size=8GB --script=./test.rcS



test.rcS:
/bin/lulesh2.0  -p -s 100
m5 exit



The lulesh2.0 can be run completely according to the output. And gem5 exits 
naturally.



                       
Original
                       
                     

From:"YongjieHuang via gem5-users"< gem5-users@gem5.org &gt;;

Date:2024/8/16 15:57

To:"gem5-users"< gem5-users@gem5.org &gt;;

CC:"YongjieHuang"< 876167...@qq.com &gt;;

Subject:[gem5-users] BUG: kernel NULL pointer dereference occurs when restoring 
a checkpoint generated by KVM core in FS mode



Dear all,




I want to use KVM core to write checkpoints and use O3 core to restore the 
checkpoints. But I meet a kernel BUG.

My Gem5 is V23.0.0.1. The image and kernel were downloaded 
from&nbsp;https://www.gem5.org/project/2020/03/09/boot-tests.html.

The kernel is 'vmlinux-5.4.49' and the image is&nbsp;&nbsp;'disk image 
(GZIPPED)'.




I used X86KvmCPU to write a checkpoint during the time when lulesh2.0 is 
running with openMP.&nbsp; Below is the script for booting the system and write 
a checkpoint when hitting 9 Billion instructions.
/mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py 
--disk-image=./x86-64-system/disks/disk.img 
--kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 
--cpu-type=X86KvmCPU --mem-size=8GB --checkpoint-dir=ckptest --at-instruction 
--take-checkpoints 9000000000 --script=./test.rcS



test.rcS is the script for running lulesh2.0 which is already located in /bin 
of the disk image manually by sudo mount :


/bin/lulesh2.0 &nbsp;-p -s 100
m5 exit



However, when I use O3 cpu to restore the checkpoint written above with the 
command line below, I can see a kernel BUG in&nbsp;system.pc.com_1.device file 
instead of seeing the lulesh process continuing.


command line: /mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py 
--disk-image=./x86-64-system/disks/disk.img 
--kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 
--cpu-type=X86O3CPU --caches --cpu-clock=2.4GHz --l1i_size=32kB --l1i_assoc=8 
--l1d_size=64kB --l1d_assoc=8 --l2cache --l2_size=1MB --l2_assoc=16 --l3cache 
--l3_size=16MB --l3_assoc=16 --mem-size=8GB --checkpoint-dir=ckptest -r 1



BUG: kernel NULL pointer dereference, address: 0000000000000040#PF: supervisor 
read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.4.49 #8
Hardware name: &nbsp;, BIOS &nbsp;06/08/2008
Workqueue: &nbsp;0x0 (events)
RIP: 0010:set_next_entity+0x9/0x65
Code: 48 89 df 5b 5d 41 5c e9 fb a0 ff ff 59 48 89 df 31 d2 5b 5d 41 5c e9 35 
a4 ff ff 58 5b 5d 41 5c c3 41 55 41 54 55 53 48 89 fd <83&gt; 7e 40 00 48 89 f3 
74 35 4c 8d 66 18 4c 3b 67 40 4c 8d 6f 38 75
RSP: 0018:ffffc90000037e30 EFLAGS: 0000006e
RAX: 0000000000000000 RBX: ffff888238a26440 RCX: ffffffff81a19d80
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888238a26480
RBP: ffff888238a26480 R08: 00000003e5663c00 R09: 00000000000000ff
R10: 00000000fffbad80 R11: 0000000000000800 R12: 0000000000000000
R13: ffff888238a26480 R14: ffff8882379749b0 R15: 0000000000000000
FS: &nbsp;00007faa47824700(0000) GS:ffff888238a00000(0000) 
knlGS:0000000000000000
CS: &nbsp;0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000b0 CR3: 00000002358e8000 CR4: 00000000000006f0
Call Trace:
&nbsp;pick_next_task_fair+0xe5/0x18c
&nbsp;__schedule+0x1e3/0x40a
&nbsp;? do_raw_spin_lock+0x2b/0x52
&nbsp;? create_worker+0x16a/0x16a
&nbsp;schedule+0x75/0x9f
&nbsp;worker_thread+0x1e7/0x22f
&nbsp;kthread+0xf0/0xf5
&nbsp;? kthread_destroy_worker+0x39/0x39
&nbsp;ret_from_fork+0x22/0x40
Modules linked in:
CR2: 0000000000000040
---[ end trace 25f0872c331972c4 ]---
BUG: kernel NULL pointer dereference, address: 0000000000000040
RIP: 0010:set_next_entity+0x9/0x65



In addition, I am sure in the checkpoinit generating process, the lulesh2.0 is 
running successfully in the guest system accoding to the ouput of -p parameter 
of lulesh.


Can anyone tell me what should I do ?
I really appreciate your helps!


Best,
Yongjie

&nbsp;
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to