Cédric Le Goater <c...@kaod.org> writes:
> On 6/29/22 20:24, Alex Bennée wrote: >> Cédric Le Goater <c...@kaod.org> writes: >> >>> On 6/29/22 16:14, Alex Bennée wrote: >>>> Cédric Le Goater <c...@kaod.org> writes: >>>> >>>>> On 6/24/22 18:50, Cédric Le Goater wrote: >>>>>> On 6/23/22 20:43, Peter Delevoryas wrote: >>>>>>> >>>>>>> >>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <c...@kaod.org> wrote: >>>>>>>> >>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote: >>>>>>>>> Signed-off-by: Peter Delevoryas <p...@fb.com> >>>>>>>> >>>>>>>> Let's start simple without flash support. We should be able to >>>>>>>> load FW blobs in each CPU address space using loader devices. >>>>>>> >>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC >>>>>>> boot sequence is a little weird. I specifically _needed_ to have >>>>>>> a flash device which maps the firmware in at 0x2000_0000, because >>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing >>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow. >>>>>>> >>>>>>> $ ./build/qemu-system-arm -machine fby35 \ >>>>>>> -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \ >>>>>>> -d int -drive file=fby35.mtd,format=raw,if=mtd >>>>>> Ideally we should be booting from the flash device directly using >>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW >>>>>> does. Instructions are fetched using SPI transfers. But the amount >>>>>> of code generated is tremendous. >>>> Yeah because there is a potential race when reading from HW so we >>>> throw >>>> away TB's after executing them because we have no way of knowing if it >>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM >>>> execution code) which cleaned up this handling. >>>> >>>>>> See some profiling below for a >>>>>> run which barely reaches DRAM training in U-Boot. >>>>> >>>>> Some more profiling on both ast2500 and ast2600 machines shows : >>>>> >>>>> >>>>> * ast2600-evb,execute-in-place=true : >>>>> >>>>> Type Object Call site Wait Time (s) >>>>> Count Average (us) >>>>> --------------------------------------------------------------------------------------------- >>>>> BQL mutex 0x564dc03922e0 accel/tcg/cputlb.c:1365 14.21443 >>>>> 32909927 0.43 >>>> This is unavoidable as a HW access needs the BQL held so we will go >>>> through this cycle every executed instruction. >>>> Did I miss why the flash contents are not mapped into the physical >>>> address space? Isn't that how it appear to the processor? >>> >>> >>> There are two modes : >>> if (ASPEED_MACHINE(machine)->mmio_exec) { >>> memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom", >>> &fl->mmio, 0, size); >>> memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, >>> boot_rom); >>> } else { >>> memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom", >>> size, &error_abort); >>> memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, >>> boot_rom); >>> write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort); >>> } >>> >>> The default boot mode uses the ROM. No issue. >>> >>> The "execute-in-place=true" option creates an alias on the region of >>> the flash contents and each instruction is then fetched from the flash >>> drive with SPI transactions. I wonder if we were keeping the code before? >>> >>> With old FW images, using an older U-boot, the machine boots in a couple >>> of seconds. See the profiling below for a witherspoon-bmc machine using >>> U-Boot 2016.07. >>> >>> qemu-system-arm -M witherspoon-bmc,execute-in-place=true -drive >>> file=./flash-witherspoon-bmc,format=raw,if=mtd -drive >>> file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults >>> -snapshot -serial mon:stdio -enable-sync-profile >>> ... >>> U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200) >>> Watchdog enabled >>> DRAM: 496 MiB >>> Flash: 32 MiB >>> In: serial >>> Out: serial >>> Err: serial >>> Net: >>> (qemu) info sync-profile >>> Type Object Call site Wait Time (s) >>> Count Average (us) >>> >>> --------------------------------------------------------------------------------------------- >>> BQL mutex 0x56189610b2e0 accel/tcg/cputlb.c:1365 0.25311 >>> 12346237 0.02 >>> condvar 0x5618970cf220 softmmu/cpus.c:423 0.05506 >>> 2 27530.78 >>> BQL mutex 0x56189610b2e0 util/rcu.c:269 0.04709 >>> 2 23544.26 >>> condvar 0x561896d0fc78 util/thread-pool.c:90 0.01340 >>> 83 161.47 >>> condvar 0x56189610b240 softmmu/cpus.c:571 0.00005 >>> 1 54.93 >>> condvar 0x56189610b280 softmmu/cpus.c:642 0.00003 >>> 1 32.88 >>> BQL mutex 0x56189610b2e0 util/main-loop.c:318 0.00003 >>> 34 0.76 >>> mutex 0x561896eade00 tcg/region.c:204 0.00002 >>> 995 0.02 >>> rec_mutex [ 2] util/async.c:682 0.00002 >>> 493 0.03 >>> mutex [ 2] chardev/char.c:118 0.00001 >>> 404 0.03 >>> >>> --------------------------------------------------------------------------------------------- >>> >>> >>> However, with recent U-boots, it takes quite a while to reach DRAM training. >>> Close to a minute. See the profiling below for an ast2500-evb machine using >>> U-Boot 2019.04. >>> >>> qemu-system-arm -M ast2500-evb,execute-in-place=true -net >>> nic,macaddr=C0:FF:EE:00:00:03,netdev=net0 -drive >>> file=./flash-ast2500-evb,format=raw,if=mtd -nographic -nodefaults >>> -snapshot -serial mon:stdio -enable-sync-profile >>> qemu-system-arm: warning: Aspeed iBT has no chardev backend >>> qemu-system-arm: warning: nic ftgmac100.1 has no peer >>> QEMU 7.0.50 monitor - type 'help' for more information >>> U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03 >>> +0200) >>> SOC : AST2500-A1 >>> RST : Power On >>> LPC Mode : SIO:Enable : SuperIO-2e >>> Eth : MAC0: RGMII, , MAC1: RGMII, >>> Model: AST2500 EVB >>> DRAM: 448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off) >>> MMC: sdhci_slot0@100: 0, sdhci_slot1@200: 1 >>> Loading Environment from SPI Flash... SF: Detected mx25l25635e with page >>> size 256 Bytes, erase size 64 KiB, total 32 MiB >>> *** Warning - bad CRC, using default environment >>> In: serial@1e784000 >>> Out: serial@1e784000 >>> Err: serial@1e784000 >>> Net: eth0: ethernet@1e660000 >>> Warning: ethernet@1e680000 (eth1) using random MAC address - >>> 4a:e5:9a:4a:c7:c5 >>> , eth1: ethernet@1e680000 >>> Hit any key to stop autoboot: 2 >>> (qemu) info sync-profile >>> Type Object Call site Wait Time (s) >>> Count Average (us) >>> >>> --------------------------------------------------------------------------------------------- >>> condvar 0x561f10c9ef88 util/thread-pool.c:90 10.01196 >>> 28 357570.00 >>> BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1365 0.29496 >>> 14248621 0.02 >>> condvar 0x561f110325a0 softmmu/cpus.c:423 0.02231 >>> 2 11152.57 >>> BQL mutex 0x561f102362e0 util/rcu.c:269 0.01447 >>> 4 3618.60 >>> condvar 0x561f10236240 softmmu/cpus.c:571 0.00010 >>> 1 102.19 >>> mutex 0x561f10e9f1c0 tcg/region.c:204 0.00007 >>> 3052 0.02 >>> mutex [ 2] chardev/char.c:118 0.00003 >>> 1486 0.02 >>> condvar 0x561f10236280 softmmu/cpus.c:642 0.00003 >>> 1 29.38 >>> BQL mutex 0x561f102362e0 accel/tcg/cputlb.c:1426 0.00002 >>> 973 0.02 >>> BQL mutex 0x561f102362e0 util/main-loop.c:318 0.00001 >>> 34 0.41 >>> >>> --------------------------------------------------------------------------------------------- >>> Something in the layout of the FW is making a big difference. One >>> that could be relevant is that the recent versions are using a device >>> tree. >>> >>> There might be no good solution to this issue but I fail to analyze >>> it correctly. Is there a way to collect information on the usage of >>> Translation Blocks ? >> You could expand the data we collect in tb_tree_stats and expose it >> via >> info jit. > > The "fast" run, U-Boot 2016.07, gives : > > > Translation buffer state: > gen code size 254880371/1073736704 > TB count 1089 > TB avg target size 16 max=356 bytes > TB avg host size 278 bytes (expansion ratio: 17.2) > cross page TB count 0 (0%) > direct jump count 501 (46%) (2 jumps=372 34%) > TB hash buckets 1025/8192 (12.51% head buckets used) > TB hash occupancy 3.32% avg chain occ. Histogram: [0.0,7.5)%|█ ▁ ▁ > ▁|[67.5,75.0]% > TB hash avg chain 1.000 buckets. Histogram: 1|█|1 > Statistics: > TB flush count 0 > TB invalidate count 0 > TLB full flushes 0 > TLB partial flushes 2 > TLB elided flushes 2338 > JIT cycles 2221788409 (0.926 s at 2.4 GHz) > translated TBs 738520 (aborted=0 0.0%) > avg ops/TB 15.7 max=459 > deleted ops/TB 2.72 > avg temps/TB 32.89 max=88 > avg host code/TB 113.7 > avg search data/TB 5.2 > cycles/op 192.0 > cycles/in byte 748.7 > cycles/out byte 26.4 > cycles/search byte 582.8 > gen_interm time 57.6% > gen_code time 42.4% > optim./code time 19.4% > liveness/code time 26.1% > cpu_restore count 0 > avg cycles 0.0 > and the "slow", U-Boot 2019.04 : > > Translation buffer state: > gen code size 368603795/1073736704 > TB count 3052 > TB avg target size 16 max=360 bytes > TB avg host size 293 bytes (expansion ratio: 17.6) > cross page TB count 0 (0%) > direct jump count 1431 (46%) (2 jumps=1104 36%) > TB hash buckets 2559/8192 (31.24% head buckets used) > TB hash occupancy 9.31% avg chain occ. Histogram: [0,10)%|█ ▃ ▁ ▁ > ▁|[90,100]% > TB hash avg chain 1.000 buckets. Histogram: 1|█|1 > > Statistics: > TB flush count 3 Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000 TBs). This can be the TB overflowing but: fast: gen code size 254880371/1073736704 slow: gen code size 368603795/1073736704 doesn't look like we are getting close to overflowing. Must be something else. > TB invalidate count 0 > TLB full flushes 0 > TLB partial flushes 3 > TLB elided flushes 2367 > JIT cycles 26479044772 (11.033 s at 2.4 GHz) > translated TBs 10552169 (aborted=0 0.0%) > avg ops/TB 15.0 max=464 > deleted ops/TB 2.44 > avg temps/TB 32.43 max=89 > avg host code/TB 99.0 > avg search data/TB 5.0 > cycles/op 167.7 > cycles/in byte 626.8 > cycles/out byte 25.4 > cycles/search byte 499.4 > gen_interm time 50.4% > gen_code time 49.6% > optim./code time 19.5% > liveness/code time 27.7% > cpu_restore count 0 > avg cycles 0.0 > > > A lot more TBs. > C. -- Alex Bennée