[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
On 07/23/2010 01:02 PM, Stefan Hajnoczi wrote: >> In fact, we solve this problem through a really simple method. >> In our prototype, we removed this piece of code like this: >> void *qemu_get_ram_ptr(ram_addr_t addr) >> { >> .. >> >> /* Move this entry to to start of the list. */ >> #ifndef CONFIG_COREMU >> /* Different core can access this function at the same time. >> * For coremu, disable this optimization to avoid data race. >> * XXX or use spin lock here if performance impact is big. */ >> if (prev) { >> prev->next = block->next; >> block->next = *prevp; >> *prevp = block; >> } >> #endif >> return block->host + (addr - block->offset); >> } >> >> CONFIG_COREMU is defined when TCG parallel mode is configured. >> And the list is more likely to be read only without hotplug device, so >> we don't use a lock to protect it. >> Reimplement this list with a lock free list is also reasonable, but >> seems unnecessary. :-) > > Ah, good :). For this one in particular, you could just use circular lists (without a "head" node, unlike the Linux kernel's list data type, as there's always a RAM entry) and start iteration at "prev". Paolo
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
2010/7/23 wang Tiger : > 在 2010年7月23日 下午5:13,Stefan Hajnoczi 写道: >> 2010/7/23 Alexander Graf : >>> >>> On 23.07.2010, at 09:53, Jan Kiszka wrote: >>> wang Tiger wrote: > 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: >> 2010/7/22 wang Tiger : >>> In our implementation for x86_64 target, all devices except LAPIC are >>> emulated in a seperate thread. VCPUs are emulated in other threads >>> (one thread per VCPU). >>> By observing some device drivers in linux, we have a hypothethis that >>> drivers in OS have already ensured correct synchronization on >>> concurrent hardware accesses. >> This hypothesis is too optimistic. If hardware emulation code assumes >> it is only executed in a single-threaded fashion, but guests can >> execute it in parallel, then this opens up the possibility of race >> conditions that malicious guests can exploit. There needs to be >> isolation: a guest should not be able to cause QEMU to crash. > > In our prototype, we assume the guest behaves correctly. If hardware > emulation code can ensure atomic access(behave like real hardware), > VCPUS can access device freely. We actually refine some hardward > emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of > hardware access. This approach is surely helpful for a prototype to explore the limits. But it's not applicable to production systems. It would create a huge source of potential subtle regressions for other guest OSes, specifically those that you cannot analyze regarding synchronized hardware access. We must play safe. That's why we currently have the global mutex. Its conversion can only happen step-wise, e.g. by establishing an infrastructure to declare the need of device models for that Big Lock. Then you can start converting individual models to private locks or even smart lock-less patterns. >>> >>> But isn't that independent from making TCG atomic capable and parallel? At >>> that point a TCG vCPU would have the exact same issues and interfaces as a >>> KVM vCPU, right? And then we can tackle the concurrent device access issues >>> together. >> >> An issue that might affect COREMU today is core QEMU subsystems that >> are not thread-safe and used from hardware emulation, for example: >> >> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr(). >> This function moves the found RAMBlock to the head of the global RAM >> blocks list in a non-atomic way. Therefore, two unrelated hardware >> devices executing cpu_physical_memory_*() simultaneously face a race >> condition. I have seen this happen when playing with parallel >> hardware emulation. >> >> Tiger: If you are only locking the hardware thread for ARM target, >> your hardware emulation is not safe for other targets. Have I missed >> something in the COREMU patch that defends against this problem? >> >> Stefan >> > In fact, we solve this problem through a really simple method. > In our prototype, we removed this piece of code like this: > void *qemu_get_ram_ptr(ram_addr_t addr) > { >.. > >/* Move this entry to to start of the list. */ > #ifndef CONFIG_COREMU >/* Different core can access this function at the same time. > * For coremu, disable this optimization to avoid data race. > * XXX or use spin lock here if performance impact is big. */ >if (prev) { >prev->next = block->next; >block->next = *prevp; >*prevp = block; >} > #endif >return block->host + (addr - block->offset); > } > > CONFIG_COREMU is defined when TCG parallel mode is configured. > And the list is more likely to be read only without hotplug device, so > we don't use a lock to protect it. > Reimplement this list with a lock free list is also reasonable, but > seems unnecessary. :-) Ah, good :). Stefan > -- > Zhaoguo Wang, Parallel Processing Institute, Fudan University > > Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China > > tigerwang1...@gmail.com > http://ppi.fudan.edu.cn/zhaoguo_wang >
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
在 2010年7月23日 下午5:13,Stefan Hajnoczi 写道: > 2010/7/23 Alexander Graf : >> >> On 23.07.2010, at 09:53, Jan Kiszka wrote: >> >>> wang Tiger wrote: 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: > 2010/7/22 wang Tiger : >> In our implementation for x86_64 target, all devices except LAPIC are >> emulated in a seperate thread. VCPUs are emulated in other threads >> (one thread per VCPU). >> By observing some device drivers in linux, we have a hypothethis that >> drivers in OS have already ensured correct synchronization on >> concurrent hardware accesses. > This hypothesis is too optimistic. If hardware emulation code assumes > it is only executed in a single-threaded fashion, but guests can > execute it in parallel, then this opens up the possibility of race > conditions that malicious guests can exploit. There needs to be > isolation: a guest should not be able to cause QEMU to crash. In our prototype, we assume the guest behaves correctly. If hardware emulation code can ensure atomic access(behave like real hardware), VCPUS can access device freely. We actually refine some hardward emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of hardware access. >>> >>> This approach is surely helpful for a prototype to explore the limits. >>> But it's not applicable to production systems. It would create a huge >>> source of potential subtle regressions for other guest OSes, >>> specifically those that you cannot analyze regarding synchronized >>> hardware access. We must play safe. >>> >>> That's why we currently have the global mutex. Its conversion can only >>> happen step-wise, e.g. by establishing an infrastructure to declare the >>> need of device models for that Big Lock. Then you can start converting >>> individual models to private locks or even smart lock-less patterns. >> >> But isn't that independent from making TCG atomic capable and parallel? At >> that point a TCG vCPU would have the exact same issues and interfaces as a >> KVM vCPU, right? And then we can tackle the concurrent device access issues >> together. > > An issue that might affect COREMU today is core QEMU subsystems that > are not thread-safe and used from hardware emulation, for example: > > cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr(). > This function moves the found RAMBlock to the head of the global RAM > blocks list in a non-atomic way. Therefore, two unrelated hardware > devices executing cpu_physical_memory_*() simultaneously face a race > condition. I have seen this happen when playing with parallel > hardware emulation. > > Tiger: If you are only locking the hardware thread for ARM target, > your hardware emulation is not safe for other targets. Have I missed > something in the COREMU patch that defends against this problem? > > Stefan > In fact, we solve this problem through a really simple method. In our prototype, we removed this piece of code like this: void *qemu_get_ram_ptr(ram_addr_t addr) { .. /* Move this entry to to start of the list. */ #ifndef CONFIG_COREMU /* Different core can access this function at the same time. * For coremu, disable this optimization to avoid data race. * XXX or use spin lock here if performance impact is big. */ if (prev) { prev->next = block->next; block->next = *prevp; *prevp = block; } #endif return block->host + (addr - block->offset); } CONFIG_COREMU is defined when TCG parallel mode is configured. And the list is more likely to be read only without hotplug device, so we don't use a lock to protect it. Reimplement this list with a lock free list is also reasonable, but seems unnecessary. :-) -- Zhaoguo Wang, Parallel Processing Institute, Fudan University Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China tigerwang1...@gmail.com http://ppi.fudan.edu.cn/zhaoguo_wang
[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
在 2010年7月23日 下午3:53,Jan Kiszka 写道: > wang Tiger wrote: >> 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: >>> 2010/7/22 wang Tiger : In our implementation for x86_64 target, all devices except LAPIC are emulated in a seperate thread. VCPUs are emulated in other threads (one thread per VCPU). By observing some device drivers in linux, we have a hypothethis that drivers in OS have already ensured correct synchronization on concurrent hardware accesses. >>> This hypothesis is too optimistic. If hardware emulation code assumes >>> it is only executed in a single-threaded fashion, but guests can >>> execute it in parallel, then this opens up the possibility of race >>> conditions that malicious guests can exploit. There needs to be >>> isolation: a guest should not be able to cause QEMU to crash. >> >> In our prototype, we assume the guest behaves correctly. If hardware >> emulation code can ensure atomic access(behave like real hardware), >> VCPUS can access device freely. We actually refine some hardward >> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of >> hardware access. > > This approach is surely helpful for a prototype to explore the limits. > But it's not applicable to production systems. It would create a huge > source of potential subtle regressions for other guest OSes, > specifically those that you cannot analyze regarding synchronized > hardware access. We must play safe. > > That's why we currently have the global mutex. Its conversion can only > happen step-wise, e.g. by establishing an infrastructure to declare the > need of device models for that Big Lock. Then you can start converting > individual models to private locks or even smart lock-less patterns. > > Jan > > I agree with you on this point. The approach we used is really helpful for a research prototype. But it needs a lot of work to make it applicable to production systems. Its my pleasure if we can tackle this issue togethor. -- Zhaoguo Wang, Parallel Processing Institute, Fudan University Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China tigerwang1...@gmail.com http://ppi.fudan.edu.cn/zhaoguo_wang
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
Stefan Hajnoczi wrote: > 2010/7/23 Alexander Graf : >> On 23.07.2010, at 09:53, Jan Kiszka wrote: >> >>> wang Tiger wrote: 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: > 2010/7/22 wang Tiger : >> In our implementation for x86_64 target, all devices except LAPIC are >> emulated in a seperate thread. VCPUs are emulated in other threads >> (one thread per VCPU). >> By observing some device drivers in linux, we have a hypothethis that >> drivers in OS have already ensured correct synchronization on >> concurrent hardware accesses. > This hypothesis is too optimistic. If hardware emulation code assumes > it is only executed in a single-threaded fashion, but guests can > execute it in parallel, then this opens up the possibility of race > conditions that malicious guests can exploit. There needs to be > isolation: a guest should not be able to cause QEMU to crash. In our prototype, we assume the guest behaves correctly. If hardware emulation code can ensure atomic access(behave like real hardware), VCPUS can access device freely. We actually refine some hardward emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of hardware access. >>> This approach is surely helpful for a prototype to explore the limits. >>> But it's not applicable to production systems. It would create a huge >>> source of potential subtle regressions for other guest OSes, >>> specifically those that you cannot analyze regarding synchronized >>> hardware access. We must play safe. >>> >>> That's why we currently have the global mutex. Its conversion can only >>> happen step-wise, e.g. by establishing an infrastructure to declare the >>> need of device models for that Big Lock. Then you can start converting >>> individual models to private locks or even smart lock-less patterns. >> But isn't that independent from making TCG atomic capable and parallel? At >> that point a TCG vCPU would have the exact same issues and interfaces as a >> KVM vCPU, right? And then we can tackle the concurrent device access issues >> together. > > An issue that might affect COREMU today is core QEMU subsystems that > are not thread-safe and used from hardware emulation, for example: > > cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr(). > This function moves the found RAMBlock to the head of the global RAM > blocks list in a non-atomic way. Therefore, two unrelated hardware > devices executing cpu_physical_memory_*() simultaneously face a race > condition. I have seen this happen when playing with parallel > hardware emulation. Those issues need to be identified and, in a first step, worked around by holding dedicated locks or just the global mutex. Maybe the above conflict can also directly be resolved by creating per-VCPU lookup lists (likely more efficient than tapping on other VCPU shoes by constantly reordering a global list). Likely a good example for a self-contained preparatory patch. However, getting concurrency right is tricky enough. We should really be careful with turning to much upside down in a rush. Even if TCG may have some deeper hooks into the device model or thread-unsafe core parts than KVM, parallelizing it can and should remain a separate topic. And we also have to keep an eye on performance if a bit less than 255 VCPUs shall be emulated. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
2010/7/23 Alexander Graf : > > On 23.07.2010, at 09:53, Jan Kiszka wrote: > >> wang Tiger wrote: >>> 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: 2010/7/22 wang Tiger : > In our implementation for x86_64 target, all devices except LAPIC are > emulated in a seperate thread. VCPUs are emulated in other threads > (one thread per VCPU). > By observing some device drivers in linux, we have a hypothethis that > drivers in OS have already ensured correct synchronization on > concurrent hardware accesses. This hypothesis is too optimistic. If hardware emulation code assumes it is only executed in a single-threaded fashion, but guests can execute it in parallel, then this opens up the possibility of race conditions that malicious guests can exploit. There needs to be isolation: a guest should not be able to cause QEMU to crash. >>> >>> In our prototype, we assume the guest behaves correctly. If hardware >>> emulation code can ensure atomic access(behave like real hardware), >>> VCPUS can access device freely. We actually refine some hardward >>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of >>> hardware access. >> >> This approach is surely helpful for a prototype to explore the limits. >> But it's not applicable to production systems. It would create a huge >> source of potential subtle regressions for other guest OSes, >> specifically those that you cannot analyze regarding synchronized >> hardware access. We must play safe. >> >> That's why we currently have the global mutex. Its conversion can only >> happen step-wise, e.g. by establishing an infrastructure to declare the >> need of device models for that Big Lock. Then you can start converting >> individual models to private locks or even smart lock-less patterns. > > But isn't that independent from making TCG atomic capable and parallel? At > that point a TCG vCPU would have the exact same issues and interfaces as a > KVM vCPU, right? And then we can tackle the concurrent device access issues > together. An issue that might affect COREMU today is core QEMU subsystems that are not thread-safe and used from hardware emulation, for example: cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr(). This function moves the found RAMBlock to the head of the global RAM blocks list in a non-atomic way. Therefore, two unrelated hardware devices executing cpu_physical_memory_*() simultaneously face a race condition. I have seen this happen when playing with parallel hardware emulation. Tiger: If you are only locking the hardware thread for ARM target, your hardware emulation is not safe for other targets. Have I missed something in the COREMU patch that defends against this problem? Stefan
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
On 23.07.2010, at 09:53, Jan Kiszka wrote: > wang Tiger wrote: >> 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: >>> 2010/7/22 wang Tiger : In our implementation for x86_64 target, all devices except LAPIC are emulated in a seperate thread. VCPUs are emulated in other threads (one thread per VCPU). By observing some device drivers in linux, we have a hypothethis that drivers in OS have already ensured correct synchronization on concurrent hardware accesses. >>> This hypothesis is too optimistic. If hardware emulation code assumes >>> it is only executed in a single-threaded fashion, but guests can >>> execute it in parallel, then this opens up the possibility of race >>> conditions that malicious guests can exploit. There needs to be >>> isolation: a guest should not be able to cause QEMU to crash. >> >> In our prototype, we assume the guest behaves correctly. If hardware >> emulation code can ensure atomic access(behave like real hardware), >> VCPUS can access device freely. We actually refine some hardward >> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of >> hardware access. > > This approach is surely helpful for a prototype to explore the limits. > But it's not applicable to production systems. It would create a huge > source of potential subtle regressions for other guest OSes, > specifically those that you cannot analyze regarding synchronized > hardware access. We must play safe. > > That's why we currently have the global mutex. Its conversion can only > happen step-wise, e.g. by establishing an infrastructure to declare the > need of device models for that Big Lock. Then you can start converting > individual models to private locks or even smart lock-less patterns. But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together. Alex
[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
wang Tiger wrote: > 在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: >> 2010/7/22 wang Tiger : >>> In our implementation for x86_64 target, all devices except LAPIC are >>> emulated in a seperate thread. VCPUs are emulated in other threads >>> (one thread per VCPU). >>> By observing some device drivers in linux, we have a hypothethis that >>> drivers in OS have already ensured correct synchronization on >>> concurrent hardware accesses. >> This hypothesis is too optimistic. If hardware emulation code assumes >> it is only executed in a single-threaded fashion, but guests can >> execute it in parallel, then this opens up the possibility of race >> conditions that malicious guests can exploit. There needs to be >> isolation: a guest should not be able to cause QEMU to crash. > > In our prototype, we assume the guest behaves correctly. If hardware > emulation code can ensure atomic access(behave like real hardware), > VCPUS can access device freely. We actually refine some hardward > emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of > hardware access. This approach is surely helpful for a prototype to explore the limits. But it's not applicable to production systems. It would create a huge source of potential subtle regressions for other guest OSes, specifically those that you cannot analyze regarding synchronized hardware access. We must play safe. That's why we currently have the global mutex. Its conversion can only happen step-wise, e.g. by establishing an infrastructure to declare the need of device models for that Big Lock. Then you can start converting individual models to private locks or even smart lock-less patterns. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
在 2010年7月22日 下午11:47,Stefan Hajnoczi 写道: > 2010/7/22 wang Tiger : >> 在 2010年7月22日 下午9:00,Jan Kiszka 写道: >>> Stefan Hajnoczi wrote: On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei wrote: > On 2010-7-22, at 上午1:04, Stefan Weil wrote: > >> Am 21.07.2010 09:03, schrieb Chen Yufei: >>> On 2010-7-21, at 上午5:43, Blue Swirl wrote: >>> >>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: > We are pleased to announce COREMU, which is a > "multicore-on-multicore" full-system emulator built on Qemu. (Simply > speaking, we made Qemu parallel.) > > The project web page is located at: > http://ppi.fudan.edu.cn/coremu > > You can also download the source code, images for playing on > sourceforge > http://sf.net/p/coremu > > COREMU is composed of > 1. a parallel emulation library > 2. a set of patches to qemu > (We worked on the master branch, commit > 54d7cf136f040713095cbc064f62d753bff6f9d2) > > It currently supports full-system emulation of x64 and ARM MPcore > platforms. > > By leveraging the underlying multicore resources, it can emulate up > to 255 cores running commodity operating systems (even on a 4-core > machine). > > Enjoy, > Nice work. Do you plan to submit the improvements back to upstream QEMU? >>> It would be great if we can submit our code to QEMU, but we do not know >>> the process. >>> Would you please give us some instructions? >>> >>> -- >>> Best regards, >>> Chen Yufei >>> >> Some hints can be found here: >> http://wiki.qemu.org/Contribute/StartHere >> >> Kind regards, >> Stefan Weil > The patch is in the attachment, produced with command > git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 > > In order to separate what need to be done to make QEMU parallel, we > created a separate library, and the patched QEMU need to be compiled and > linked with that library. To submit our enhancement to QEMU, maybe we > need to incorporate this library into QEMU. I don't know what would be > the best solution. > > Our approach to make QEMU parallel can be found at > http://ppi.fudan.edu.cn/coremu > > I will give a short summary here: > > 1. Each emulated core thread runs a separate binary translator engine and > has private code cache. We marked some variables in TCG as thread local. > We also modified the TB invalidation mechanism. > > 2. Each core has a queue holding pending interrupts. The COREMU library > provides this queue, and interrupt notification is done by sending > realtime signals to the emulated core thread. > > 3. Atomic instruction emulation has to be modified for parallel > emulation. We use lightweight memory transaction which requires only > compare-and-swap instruction to emulate atomic instruction. > > 4. Some code in the original QEMU may cause data race bug after we make > it parallel. We fixed these problems. > > > > > -- > Best regards, > Chen Yufei Looking at the patch it seems there is a global lock for hardware access via coremu_spin_lock(&cm_hw_lock). How many cores have you tried running and do you have lock contention data for cm_hw_lock? >> >> The global lock for hardware access is only for ARM target in our >> implementation. It is mainly because that we are not quite familiar >> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such >> way. >> For x86_64 target, we have already made hardware emulation >> concurrently accessed. We can emulate 255 cores on a quad-core >> machine. >> >>> >>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a >>> sleepy lock here which is likely better for the code paths protected by >>> it in upstream. Are they shorter in COREMU? >>> Have you thought about making hardware emulation concurrent? These are issues that qemu-kvm faces today since it executes vcpu threads in parallel. Both qemu-kvm and the COREMU patches could benefit from a solution for concurrent hardware access. >> >> In our implementation for x86_64 target, all devices except LAPIC are >> emulated in a seperate thread. VCPUs are emulated in other threads >> (one thread per VCPU). >> By observing some device drivers in linux, we have a hypothethis that >> drivers in OS have already ensured correct synchronization on >> concurrent hardware accesses. > > This hypothesis is too optimistic. If hardware emulation code assumes > it is only executed in a single-threaded fashion, but guests can > execute it in parallel, then this opens up the possibility of race > conditions that malicious guests can exploit. T
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
2010/7/22 wang Tiger : > 在 2010年7月22日 下午9:00,Jan Kiszka 写道: >> Stefan Hajnoczi wrote: >>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei wrote: On 2010-7-22, at 上午1:04, Stefan Weil wrote: > Am 21.07.2010 09:03, schrieb Chen Yufei: >> On 2010-7-21, at 上午5:43, Blue Swirl wrote: >> >> >>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: >>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.) The project web page is located at: http://ppi.fudan.edu.cn/coremu You can also download the source code, images for playing on sourceforge http://sf.net/p/coremu COREMU is composed of 1. a parallel emulation library 2. a set of patches to qemu (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2) It currently supports full-system emulation of x64 and ARM MPcore platforms. By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine). Enjoy, >>> Nice work. Do you plan to submit the improvements back to upstream QEMU? >>> >> It would be great if we can submit our code to QEMU, but we do not know >> the process. >> Would you please give us some instructions? >> >> -- >> Best regards, >> Chen Yufei >> > Some hints can be found here: > http://wiki.qemu.org/Contribute/StartHere > > Kind regards, > Stefan Weil The patch is in the attachment, produced with command git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution. Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu I will give a short summary here: 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism. 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread. 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction. 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems. -- Best regards, Chen Yufei >>> >>> Looking at the patch it seems there is a global lock for hardware >>> access via coremu_spin_lock(&cm_hw_lock). How many cores have you >>> tried running and do you have lock contention data for cm_hw_lock? > > The global lock for hardware access is only for ARM target in our > implementation. It is mainly because that we are not quite familiar > with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such > way. > For x86_64 target, we have already made hardware emulation > concurrently accessed. We can emulate 255 cores on a quad-core > machine. > >> >> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a >> sleepy lock here which is likely better for the code paths protected by >> it in upstream. Are they shorter in COREMU? >> >>> Have you thought about making hardware emulation concurrent? >>> >>> These are issues that qemu-kvm faces today since it executes vcpu >>> threads in parallel. Both qemu-kvm and the COREMU patches could >>> benefit from a solution for concurrent hardware access. > > In our implementation for x86_64 target, all devices except LAPIC are > emulated in a seperate thread. VCPUs are emulated in other threads > (one thread per VCPU). > By observing some device drivers in linux, we have a hypothethis that > drivers in OS have already ensured correct synchronization on > concurrent hardware accesses. This hypothesis is too optimistic. If hardware emulation code assumes it is only executed in a single-threaded fashion, but guests can execute it in parallel, then this opens up the possibility of race conditions that malicious guests can exploit. There needs to be isolation: a guest should not be able to cause QEMU to crash. If you have one hardware thread that handles all device emulation and vcpu threads do no hardware emulation, t
Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
在 2010年7月22日 下午9:00,Jan Kiszka 写道: > Stefan Hajnoczi wrote: >> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei wrote: >>> On 2010-7-22, at 上午1:04, Stefan Weil wrote: >>> Am 21.07.2010 09:03, schrieb Chen Yufei: > On 2010-7-21, at 上午5:43, Blue Swirl wrote: > > >> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: >> >>> We are pleased to announce COREMU, which is a "multicore-on-multicore" >>> full-system emulator built on Qemu. (Simply speaking, we made Qemu >>> parallel.) >>> >>> The project web page is located at: >>> http://ppi.fudan.edu.cn/coremu >>> >>> You can also download the source code, images for playing on sourceforge >>> http://sf.net/p/coremu >>> >>> COREMU is composed of >>> 1. a parallel emulation library >>> 2. a set of patches to qemu >>> (We worked on the master branch, commit >>> 54d7cf136f040713095cbc064f62d753bff6f9d2) >>> >>> It currently supports full-system emulation of x64 and ARM MPcore >>> platforms. >>> >>> By leveraging the underlying multicore resources, it can emulate up to >>> 255 cores running commodity operating systems (even on a 4-core >>> machine). >>> >>> Enjoy, >>> >> Nice work. Do you plan to submit the improvements back to upstream QEMU? >> > It would be great if we can submit our code to QEMU, but we do not know > the process. > Would you please give us some instructions? > > -- > Best regards, > Chen Yufei > Some hints can be found here: http://wiki.qemu.org/Contribute/StartHere Kind regards, Stefan Weil >>> The patch is in the attachment, produced with command >>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 >>> >>> In order to separate what need to be done to make QEMU parallel, we created >>> a separate library, and the patched QEMU need to be compiled and linked >>> with that library. To submit our enhancement to QEMU, maybe we need to >>> incorporate this library into QEMU. I don't know what would be the best >>> solution. >>> >>> Our approach to make QEMU parallel can be found at >>> http://ppi.fudan.edu.cn/coremu >>> >>> I will give a short summary here: >>> >>> 1. Each emulated core thread runs a separate binary translator engine and >>> has private code cache. We marked some variables in TCG as thread local. We >>> also modified the TB invalidation mechanism. >>> >>> 2. Each core has a queue holding pending interrupts. The COREMU library >>> provides this queue, and interrupt notification is done by sending realtime >>> signals to the emulated core thread. >>> >>> 3. Atomic instruction emulation has to be modified for parallel emulation. >>> We use lightweight memory transaction which requires only compare-and-swap >>> instruction to emulate atomic instruction. >>> >>> 4. Some code in the original QEMU may cause data race bug after we make it >>> parallel. We fixed these problems. >>> >>> >>> >>> >>> -- >>> Best regards, >>> Chen Yufei >> >> Looking at the patch it seems there is a global lock for hardware >> access via coremu_spin_lock(&cm_hw_lock). How many cores have you >> tried running and do you have lock contention data for cm_hw_lock? The global lock for hardware access is only for ARM target in our implementation. It is mainly because that we are not quite familiar with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such way. For x86_64 target, we have already made hardware emulation concurrently accessed. We can emulate 255 cores on a quad-core machine. > > BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a > sleepy lock here which is likely better for the code paths protected by > it in upstream. Are they shorter in COREMU? > >> Have you thought about making hardware emulation concurrent? >> >> These are issues that qemu-kvm faces today since it executes vcpu >> threads in parallel. Both qemu-kvm and the COREMU patches could >> benefit from a solution for concurrent hardware access. In our implementation for x86_64 target, all devices except LAPIC are emulated in a seperate thread. VCPUs are emulated in other threads (one thread per VCPU). By observing some device drivers in linux, we have a hypothethis that drivers in OS have already ensured correct synchronization on concurrent hardware accesses. For example, when emulating IDE with bus master DMA, 1. Two VCPUs will not send disk w/r requests at the same time. 2. New DMA request will not be sent until the previous one has completed. These two points guarantee the emulated IDE with DMA can be concurrently accessed by either VCPU thread or hw thread with no additional locks. The only work we need to do is to fix some misbehaving emulated device in current Qemu. For example, in the function ide_write_dma_cb of Qemu if (s->nsector == 0) { s->status = READY_STAT | SEEK_STAT; ide_set_irq(s->bus); /* In parallel emulation, OS ma
[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
2010/7/22 Jan Kiszka : > Stefan Hajnoczi wrote: >> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei wrote: >>> On 2010-7-22, at 上午1:04, Stefan Weil wrote: >>> Am 21.07.2010 09:03, schrieb Chen Yufei: > On 2010-7-21, at 上午5:43, Blue Swirl wrote: > > >> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: >> >>> We are pleased to announce COREMU, which is a "multicore-on-multicore" >>> full-system emulator built on Qemu. (Simply speaking, we made Qemu >>> parallel.) >>> >>> The project web page is located at: >>> http://ppi.fudan.edu.cn/coremu >>> >>> You can also download the source code, images for playing on sourceforge >>> http://sf.net/p/coremu >>> >>> COREMU is composed of >>> 1. a parallel emulation library >>> 2. a set of patches to qemu >>> (We worked on the master branch, commit >>> 54d7cf136f040713095cbc064f62d753bff6f9d2) >>> >>> It currently supports full-system emulation of x64 and ARM MPcore >>> platforms. >>> >>> By leveraging the underlying multicore resources, it can emulate up to >>> 255 cores running commodity operating systems (even on a 4-core >>> machine). >>> >>> Enjoy, >>> >> Nice work. Do you plan to submit the improvements back to upstream QEMU? >> > It would be great if we can submit our code to QEMU, but we do not know > the process. > Would you please give us some instructions? > > -- > Best regards, > Chen Yufei > Some hints can be found here: http://wiki.qemu.org/Contribute/StartHere Kind regards, Stefan Weil >>> The patch is in the attachment, produced with command >>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 >>> >>> In order to separate what need to be done to make QEMU parallel, we created >>> a separate library, and the patched QEMU need to be compiled and linked >>> with that library. To submit our enhancement to QEMU, maybe we need to >>> incorporate this library into QEMU. I don't know what would be the best >>> solution. >>> >>> Our approach to make QEMU parallel can be found at >>> http://ppi.fudan.edu.cn/coremu >>> >>> I will give a short summary here: >>> >>> 1. Each emulated core thread runs a separate binary translator engine and >>> has private code cache. We marked some variables in TCG as thread local. We >>> also modified the TB invalidation mechanism. >>> >>> 2. Each core has a queue holding pending interrupts. The COREMU library >>> provides this queue, and interrupt notification is done by sending realtime >>> signals to the emulated core thread. >>> >>> 3. Atomic instruction emulation has to be modified for parallel emulation. >>> We use lightweight memory transaction which requires only compare-and-swap >>> instruction to emulate atomic instruction. >>> >>> 4. Some code in the original QEMU may cause data race bug after we make it >>> parallel. We fixed these problems. >>> >>> >>> >>> >>> -- >>> Best regards, >>> Chen Yufei >> >> Looking at the patch it seems there is a global lock for hardware >> access via coremu_spin_lock(&cm_hw_lock). How many cores have you >> tried running and do you have lock contention data for cm_hw_lock? > > BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a > sleepy lock here which is likely better for the code paths protected by > it in upstream. Are they shorter in COREMU? > >> Have you thought about making hardware emulation concurrent? >> >> These are issues that qemu-kvm faces today since it executes vcpu >> threads in parallel. Both qemu-kvm and the COREMU patches could >> benefit from a solution for concurrent hardware access. > > While we are all looking forward to see more scalable hardware models > :), I think it is a topic that can be addressed widely independent of > parallelizing TCG VCPUs. The latter can benefit from the former, for > sure, but it first of all has to solve its own issues. Right, but it's worth discussing with people who have worked on parallel vcpus from a different angle. > Note that --enable-io-thread provides truly parallel KVM VCPUs also in > upstream these days. Just for TCG, we need that sightly suboptimal CPU > scheduling inside single-threaded tcg_cpu_exec (was renamed to > cpu_exec_all today). > > Jan > > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux >
[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
Stefan Hajnoczi wrote: > On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei wrote: >> On 2010-7-22, at 上午1:04, Stefan Weil wrote: >> >>> Am 21.07.2010 09:03, schrieb Chen Yufei: On 2010-7-21, at 上午5:43, Blue Swirl wrote: > On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: > >> We are pleased to announce COREMU, which is a "multicore-on-multicore" >> full-system emulator built on Qemu. (Simply speaking, we made Qemu >> parallel.) >> >> The project web page is located at: >> http://ppi.fudan.edu.cn/coremu >> >> You can also download the source code, images for playing on sourceforge >> http://sf.net/p/coremu >> >> COREMU is composed of >> 1. a parallel emulation library >> 2. a set of patches to qemu >> (We worked on the master branch, commit >> 54d7cf136f040713095cbc064f62d753bff6f9d2) >> >> It currently supports full-system emulation of x64 and ARM MPcore >> platforms. >> >> By leveraging the underlying multicore resources, it can emulate up to >> 255 cores running commodity operating systems (even on a 4-core machine). >> >> Enjoy, >> > Nice work. Do you plan to submit the improvements back to upstream QEMU? > It would be great if we can submit our code to QEMU, but we do not know the process. Would you please give us some instructions? -- Best regards, Chen Yufei >>> Some hints can be found here: >>> http://wiki.qemu.org/Contribute/StartHere >>> >>> Kind regards, >>> Stefan Weil >> The patch is in the attachment, produced with command >> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 >> >> In order to separate what need to be done to make QEMU parallel, we created >> a separate library, and the patched QEMU need to be compiled and linked with >> that library. To submit our enhancement to QEMU, maybe we need to >> incorporate this library into QEMU. I don't know what would be the best >> solution. >> >> Our approach to make QEMU parallel can be found at >> http://ppi.fudan.edu.cn/coremu >> >> I will give a short summary here: >> >> 1. Each emulated core thread runs a separate binary translator engine and >> has private code cache. We marked some variables in TCG as thread local. We >> also modified the TB invalidation mechanism. >> >> 2. Each core has a queue holding pending interrupts. The COREMU library >> provides this queue, and interrupt notification is done by sending realtime >> signals to the emulated core thread. >> >> 3. Atomic instruction emulation has to be modified for parallel emulation. >> We use lightweight memory transaction which requires only compare-and-swap >> instruction to emulate atomic instruction. >> >> 4. Some code in the original QEMU may cause data race bug after we make it >> parallel. We fixed these problems. >> >> >> >> >> -- >> Best regards, >> Chen Yufei > > Looking at the patch it seems there is a global lock for hardware > access via coremu_spin_lock(&cm_hw_lock). How many cores have you > tried running and do you have lock contention data for cm_hw_lock? BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a sleepy lock here which is likely better for the code paths protected by it in upstream. Are they shorter in COREMU? > Have you thought about making hardware emulation concurrent? > > These are issues that qemu-kvm faces today since it executes vcpu > threads in parallel. Both qemu-kvm and the COREMU patches could > benefit from a solution for concurrent hardware access. While we are all looking forward to see more scalable hardware models :), I think it is a topic that can be addressed widely independent of parallelizing TCG VCPUs. The latter can benefit from the former, for sure, but it first of all has to solve its own issues. Note that --enable-io-thread provides truly parallel KVM VCPUs also in upstream these days. Just for TCG, we need that sightly suboptimal CPU scheduling inside single-threaded tcg_cpu_exec (was renamed to cpu_exec_all today). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
Chen Yufei wrote: > On 2010-7-22, at 上午1:04, Stefan Weil wrote: > >> Am 21.07.2010 09:03, schrieb Chen Yufei: >>> On 2010-7-21, at 上午5:43, Blue Swirl wrote: >>> >>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei wrote: > We are pleased to announce COREMU, which is a "multicore-on-multicore" > full-system emulator built on Qemu. (Simply speaking, we made Qemu > parallel.) > > The project web page is located at: > http://ppi.fudan.edu.cn/coremu > > You can also download the source code, images for playing on sourceforge > http://sf.net/p/coremu > > COREMU is composed of > 1. a parallel emulation library > 2. a set of patches to qemu > (We worked on the master branch, commit > 54d7cf136f040713095cbc064f62d753bff6f9d2) > > It currently supports full-system emulation of x64 and ARM MPcore > platforms. > > By leveraging the underlying multicore resources, it can emulate up to > 255 cores running commodity operating systems (even on a 4-core machine). > > Enjoy, > Nice work. Do you plan to submit the improvements back to upstream QEMU? >>> It would be great if we can submit our code to QEMU, but we do not know the >>> process. >>> Would you please give us some instructions? >>> >>> -- >>> Best regards, >>> Chen Yufei >>> >> Some hints can be found here: >> http://wiki.qemu.org/Contribute/StartHere >> >> Kind regards, >> Stefan Weil > > The patch is in the attachment, produced with command > git diff 54d7cf136f040713095cbc064f62d753bff6f9d2 > > In order to separate what need to be done to make QEMU parallel, we created a > separate library, and the patched QEMU need to be compiled and linked with > that library. To submit our enhancement to QEMU, maybe we need to incorporate > this library into QEMU. I don't know what would be the best solution. For upstream QEMU, the goal should be to integrate your modifications and enhancements into the existing architecture in a mostly seamless way. The library approach may help maintaining your changes out of tree, but it likely cannot contribute any benefit to an in-tree extension of QEMU for parallel TCG VCPUs. > > Our approach to make QEMU parallel can be found at > http://ppi.fudan.edu.cn/coremu > > I will give a short summary here: > > 1. Each emulated core thread runs a separate binary translator engine and has > private code cache. We marked some variables in TCG as thread local. We also > modified the TB invalidation mechanism. > > 2. Each core has a queue holding pending interrupts. The COREMU library > provides this queue, and interrupt notification is done by sending realtime > signals to the emulated core thread. > > 3. Atomic instruction emulation has to be modified for parallel emulation. We > use lightweight memory transaction which requires only compare-and-swap > instruction to emulate atomic instruction. > > 4. Some code in the original QEMU may cause data race bug after we make it > parallel. We fixed these problems. > Upstream integration requires such iterative steps as well - in form of ideally small, focused patches that finally convert QEMU into a parallel emulator. Also note that upstream already supports threaded VCPUs - in KVM mode. You obviously have resolved the major blocking points to apply this on TCG mode as well. But I don't see yet why we may need a new VCPU threading infrastructure for this. Rather only small tuning of what KVM already uses should suffice - if that's required at all. To give it a start, you could identify some more trivial changes in your patches, split them out and rebase them over latest qemu.git, then post them as a patch series for inclusion (see the mailing list for various examples). Make sure to describe the reason for your changes as clear as possible, specifically if they are not (yet) obvious in the absence of COREMU features in upstream QEMU. Be prepared that merging your code can be a lengthy process with quite a few discussions about why and how things are done, likely also with requests to change your current solution in some aspects. However, the result should be an optimal solution for the overall goal, parallel VCPU emulation - and no longer any need to maintain your private set of patches against quickly evolving QEMU. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux