Re: Display update issue on M1 Macs

Akihiko Odaki Mon, 30 Jan 2023 23:39:07 -0800

On 2023/01/31 8:58, BALATON Zoltan wrote:

On Sat, 28 Jan 2023, Akihiko Odaki wrote:
On 2023/01/23 8:28, BALATON Zoltan wrote:
On Thu, 19 Jan 2023, Akihiko Odaki wrote:
On 2023/01/15 3:11, BALATON Zoltan wrote:
On Sat, 14 Jan 2023, Akihiko Odaki wrote:
On 2023/01/13 22:43, BALATON Zoltan wrote:
On Thu, 5 Jan 2023, BALATON Zoltan wrote:
Hello,
I got reports from several users trying to run AmigaOS4 onsam460ex on Apple silicon Macs that they get missing graphicsthat I can't reproduce on x86_64. With help from the users whoget the problem we've narrowed it down to the following:
It looks like that data written to the sm501's ram inqemu/hw/display/sm501.c::sm501_2d_operation() is then not seenfrom sm501_update_display() in the same file. Thesm501_2d_operation() function is called when the guest accessesthe emulated card so it may run in a different thread thansm501_update_display() which is called by the ui backend but I'mnot sure how QEMU calls these. Is device code running iniothread and display update in main thread? The problem is alsoindependent of the display backend and was reproduced with both-display cocoa and -display sdl.
We have confirmed it's not the pixman routines thatsm501_2d_operation() uses as the same issue is seen also withQEMU 4.x where pixman wasn't used and with all versions up to7.2 so it's also not some bisectable change in QEMU. It alsohappens with --enable-debug so it doesn't seem to be related tooptimisation either and I don't get it on x86_64 but even x86_64QEMU builds run on Apple M1 with Rosetta 2 show the problem. Italso only seems to affect graphics written fromsm501_2d_operation() which AmigaOS4 uses extensively but otherOSes don't and just render graphics with the vcpu which workwithout problem also on the M1 Macs that show this problem withAmigaOS4. Theoretically this could be some missingsyncronisation which is something ARM and PPC may need while x86doesn't but I don't know if this is really the reason and if sowhere and how to fix it). Any idea what may cause this and whatcould be a fix to try?
Any idea anyone? At least some explanation if the above isplausible or if there's an option to disable the iothread and runeveryting in a single thread to verify the theory could help.I've got reports from at least 3 people getting this problem butI can't do much to fix it without some help.
(Info on how to run it is here:
http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
but AmigaOS4 is not freely distributable so it's a bit hard toreproduce. Some Linux X servers that support sm501/sm502 mayalso use the card's 2d engine but I don't know about any liveCDs that readily run on sam460ex.)
Thank you,
BALATON Zoltan
Sorry, I missed the email.
Indeed the ui backend should call sm501_update_display() in themain thread, which should be different from the thread callingsm501_2d_operation(). However, if I understand it correctly, bothof the functions should be called with iothread lock held so thereshould be no race condition in theory.
But there is an exception:memory_region_snapshot_and_clear_dirty() releases iothread lock,and that broke raspi3b display device:
https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/
It is unexpected that gfx_update() callback releases iothread lockso it may break things in peculiar ways.
Peter, is there any change in the situation regarding the raceintroduced by memory_region_snapshot_and_clear_dirty()?
For now, to workaround the issue, I think you can create anothermutex and make the entire sm501_2d_engine_write() andsm501_update_display() critical sections.
Interesting thread but not sure it's the same problem so thisworkaround may not be enough to fix my issue. Here's a video postedby one of the people who reported it showing the problem on M1 Mac:
https://www.youtube.com/watch?v=FDqoNbp6PQs

and here's how it looks like on other machines:

https://www.youtube.com/watch?v=ML7-F4HNFKQ
There are also videos showing it running on RPi 4 and G5 Macwithout this issue so it seems to only happen on Apple Silicon M1Macs. What's strange is that graphics elements are not just delayedwhich I think should happen with missing thread synchronisationwhere the update callback would miss some pixels rendered duringit's running but subsequent update callbacks would eventually drawthose, woudn't they? Also setting full_update to 1 insm501_update_display() callback to disable dirty tracking does notfix the problem. So it looks like as if sm501_2d_operation()running on one CPU core only writes data to the local cache of thatcore which sm501_update_display() running on other core can't see,so maybe some cache synchronisation is needed inmemory_region_set_dirty() or if that's already there maybe I shouldcall it for all changes not only those in the visible display area?I'm still not sure I understand the problem and don't know whatcould be a fix for it so anything to test to identify the issuebetter might also bring us closer to a solution.
Regards,
BALATON Zoltan
If you set full_update to 1, you may also comment outmemory_region_snapshot_and_clear_dirty() andmemory_region_snapshot_get_dirty() to avoid the iothread mutex beingunlocked. The iothread mutex should ensure cache coherency as well.
But as you say, it's weird that the rendered result is not justdelayed but missed. That may imply other possibilities (e.g., theresults are overwritten by someone else). If the problem persistsafter commenting out memory_region_snapshot_and_clear_dirty() andmemory_region_snapshot_get_dirty(), I think you can assume theinter-thread coherency between sm501_2d_operation() andsm501_update_display() is not causing the problem.
I've asked people who reported and can reproduce it to test this butit did not change anything so confirmed it's not that race conditionbut looks more like some cache inconsistency maybe. Any other ideas?
Regards,
BALATON Zoltan
I can come up with two important differences between x86 and Arm whichcan affect the execution of QEMU:1. Memory model. Arm uses a memory model more relaxed than x86 so itis more sensitive for synchronization failures among threads.2. Different instructions. TCG uses JIT so differences in instructionsmatter.
We should be able to exclude 1) as a potential cause of the problem.iothread mutex should take care of race condition and even cachecoherency problem; mutex includes memory barrier functionality.
Where is this barrier in QEMU code? Does this also ensure cachecoherency between different cores or only memory sync in one core? Fromthe testing I suspect it's probably not becuase of the weak ordering ofARM but something to do with different threads writing and reading thememory area. Is there a way to disable separate vcpu thread and runeverything in a single thread to verify this theory? (We only have onevcpu so it's not an MTTCG issue but something between the vcpu and mainthread maybe.)

QEMU uses pthread_mutex for macOS, and pthread_mutex (or any sane muteximplementation for SMP systems) should also ensure memorysynchronization across different cores.

That said, it is still possible that we miss something that preventsmemory synchronization. Ideally the theory should be confirmed byexperiments, but it is not easy with Mac.

The easiest option is to run QEMU/sam460ex on Linux on QEMU/hvf. Runningthe entire Linux system without -smp option may be too slow so you mayuse taskset command on Linux to pin QEMU/sam460ex process to aparticular vCPU. This is somewhat incomplete as virtualizationinterferes with caches and hide problems or trigger other bugs. Thedifference of the operating systems is also concerning.

Another option is to use taskset command on Asahi Linux. InstallingAsahi Linux is easy, but uninstalling it is a bit complicated.

m1n1 hypervisor from Asahi Linux project allows to restrict CPUs to use,and I think it also allows to change the memory model to x86 TSO. UnlikeQEMU/hvf on macOS, it is very minimalistic so its interference to e.g.mcaches is limited. It is very useful for debugging XNU or Linux, buthard to set up and requires another computer to control it.


Finally, you can patch XNU kernel, but this is obviously not easy.

For difference 2), you may try to use TCI. You can find details of TCIin tcg/tci/README.
This was tested and also with TCI got the same results just much slower.
The common sense tells, however, the memory model is usually the causeof the problem when you see behavioral differences between x86 andArm, and TCG should work fine with both of x86 and Arm as they shouldhave been tested well.
It's not only between x86 and ARM but also between different ARM CPUs itseems as there are videos of this test case running on Raspberry Pi 4but all QEMU versions failed on Apple M1 so maybe it's somethingspecific to that CPU.

It is likely that the combination of Apple's microarchitecture and Arminstruction set causes the problem. For example, even though the memorymodel in x86 is weaker than x86, such difference may not surfacedepending on the design of load/store unit or the size of load/storebuffers.

Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1,which makes it possible to compare x86 and Arm without concerning thedifference of the microarchitecture.


Regards,
Akihiko Odaki


Regards,
BALATON Zoltan

Re: Display update issue on M1 Macs

Reply via email to