[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-25 Thread Paolo Bonzini
On 07/23/2010 01:02 PM, Stefan Hajnoczi wrote:
>> In fact, we solve this problem through a really simple method.
>> In our prototype, we removed this piece of code like this:
>> void *qemu_get_ram_ptr(ram_addr_t addr)
>> {
>> ..
>>
>> /* Move this entry to to start of the list.  */
>> #ifndef CONFIG_COREMU
>> /* Different core can access this function at the same time.
>>  * For coremu, disable this optimization to avoid data race.
>>  * XXX or use spin lock here if performance impact is big. */
>> if (prev) {
>> prev->next = block->next;
>> block->next = *prevp;
>> *prevp = block;
>> }
>> #endif
>> return block->host + (addr - block->offset);
>> }
>>
>> CONFIG_COREMU is defined when TCG parallel mode is configured.
>> And the list is more likely to be read only without hotplug device, so
>> we don't use a lock to protect it.
>> Reimplement this list with a lock free list is also reasonable, but
>> seems unnecessary. :-)
> 
> Ah, good :).

For this one in particular, you could just use circular lists (without a
"head" node, unlike the Linux kernel's list data type, as there's always
a RAM entry) and start iteration at "prev".

Paolo



Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread Stefan Hajnoczi
2010/7/23 wang Tiger :
> 在 2010年7月23日 下午5:13,Stefan Hajnoczi  写道:
>> 2010/7/23 Alexander Graf :
>>>
>>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>>
 wang Tiger wrote:
> 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
>> 2010/7/22 wang Tiger :
>>> In our implementation for x86_64 target, all devices except LAPIC are
>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>> (one thread per VCPU).
>>> By observing some device drivers in linux, we have a hypothethis that
>>> drivers in OS have already ensured correct synchronization on
>>> concurrent hardware accesses.
>> This hypothesis is too optimistic.  If hardware emulation code assumes
>> it is only executed in a single-threaded fashion, but guests can
>> execute it in parallel, then this opens up the possibility of race
>> conditions that malicious guests can exploit.  There needs to be
>> isolation: a guest should not be able to cause QEMU to crash.
>
> In our prototype, we assume the guest behaves correctly. If hardware
> emulation code can ensure atomic access(behave like real hardware),
> VCPUS can access device freely.  We actually refine some hardward
> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
> hardware access.

 This approach is surely helpful for a prototype to explore the limits.
 But it's not applicable to production systems. It would create a huge
 source of potential subtle regressions for other guest OSes,
 specifically those that you cannot analyze regarding synchronized
 hardware access. We must play safe.

 That's why we currently have the global mutex. Its conversion can only
 happen step-wise, e.g. by establishing an infrastructure to declare the
 need of device models for that Big Lock. Then you can start converting
 individual models to private locks or even smart lock-less patterns.
>>>
>>> But isn't that independent from making TCG atomic capable and parallel? At 
>>> that point a TCG vCPU would have the exact same issues and interfaces as a 
>>> KVM vCPU, right? And then we can tackle the concurrent device access issues 
>>> together.
>>
>> An issue that might affect COREMU today is core QEMU subsystems that
>> are not thread-safe and used from hardware emulation, for example:
>>
>> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
>> This function moves the found RAMBlock to the head of the global RAM
>> blocks list in a non-atomic way.  Therefore, two unrelated hardware
>> devices executing cpu_physical_memory_*() simultaneously face a race
>> condition.  I have seen this happen when playing with parallel
>> hardware emulation.
>>
>> Tiger: If you are only locking the hardware thread for ARM target,
>> your hardware emulation is not safe for other targets.  Have I missed
>> something in the COREMU patch that defends against this problem?
>>
>> Stefan
>>
> In fact, we solve this problem through a really simple method.
> In our prototype, we removed this piece of code like this:
> void *qemu_get_ram_ptr(ram_addr_t addr)
> {
>..
>
>/* Move this entry to to start of the list.  */
> #ifndef CONFIG_COREMU
>/* Different core can access this function at the same time.
> * For coremu, disable this optimization to avoid data race.
> * XXX or use spin lock here if performance impact is big. */
>if (prev) {
>prev->next = block->next;
>block->next = *prevp;
>*prevp = block;
>}
> #endif
>return block->host + (addr - block->offset);
> }
>
> CONFIG_COREMU is defined when TCG parallel mode is configured.
> And the list is more likely to be read only without hotplug device, so
> we don't use a lock to protect it.
> Reimplement this list with a lock free list is also reasonable, but
> seems unnecessary. :-)

Ah, good :).

Stefan

> --
> Zhaoguo Wang, Parallel Processing Institute, Fudan University
>
> Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China
>
> tigerwang1...@gmail.com
> http://ppi.fudan.edu.cn/zhaoguo_wang
>



Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread wang Tiger
在 2010年7月23日 下午5:13,Stefan Hajnoczi  写道:
> 2010/7/23 Alexander Graf :
>>
>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>
>>> wang Tiger wrote:
 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
> 2010/7/22 wang Tiger :
>> In our implementation for x86_64 target, all devices except LAPIC are
>> emulated in a seperate thread. VCPUs are emulated  in other threads
>> (one thread per VCPU).
>> By observing some device drivers in linux, we have a hypothethis that
>> drivers in OS have already ensured correct synchronization on
>> concurrent hardware accesses.
> This hypothesis is too optimistic.  If hardware emulation code assumes
> it is only executed in a single-threaded fashion, but guests can
> execute it in parallel, then this opens up the possibility of race
> conditions that malicious guests can exploit.  There needs to be
> isolation: a guest should not be able to cause QEMU to crash.

 In our prototype, we assume the guest behaves correctly. If hardware
 emulation code can ensure atomic access(behave like real hardware),
 VCPUS can access device freely.  We actually refine some hardward
 emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
 hardware access.
>>>
>>> This approach is surely helpful for a prototype to explore the limits.
>>> But it's not applicable to production systems. It would create a huge
>>> source of potential subtle regressions for other guest OSes,
>>> specifically those that you cannot analyze regarding synchronized
>>> hardware access. We must play safe.
>>>
>>> That's why we currently have the global mutex. Its conversion can only
>>> happen step-wise, e.g. by establishing an infrastructure to declare the
>>> need of device models for that Big Lock. Then you can start converting
>>> individual models to private locks or even smart lock-less patterns.
>>
>> But isn't that independent from making TCG atomic capable and parallel? At 
>> that point a TCG vCPU would have the exact same issues and interfaces as a 
>> KVM vCPU, right? And then we can tackle the concurrent device access issues 
>> together.
>
> An issue that might affect COREMU today is core QEMU subsystems that
> are not thread-safe and used from hardware emulation, for example:
>
> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
> This function moves the found RAMBlock to the head of the global RAM
> blocks list in a non-atomic way.  Therefore, two unrelated hardware
> devices executing cpu_physical_memory_*() simultaneously face a race
> condition.  I have seen this happen when playing with parallel
> hardware emulation.
>
> Tiger: If you are only locking the hardware thread for ARM target,
> your hardware emulation is not safe for other targets.  Have I missed
> something in the COREMU patch that defends against this problem?
>
> Stefan
>
In fact, we solve this problem through a really simple method.
In our prototype, we removed this piece of code like this:
void *qemu_get_ram_ptr(ram_addr_t addr)
{
..

/* Move this entry to to start of the list.  */
#ifndef CONFIG_COREMU
/* Different core can access this function at the same time.
 * For coremu, disable this optimization to avoid data race.
 * XXX or use spin lock here if performance impact is big. */
if (prev) {
prev->next = block->next;
block->next = *prevp;
*prevp = block;
}
#endif
return block->host + (addr - block->offset);
}

CONFIG_COREMU is defined when TCG parallel mode is configured.
And the list is more likely to be read only without hotplug device, so
we don't use a lock to protect it.
Reimplement this list with a lock free list is also reasonable, but
seems unnecessary. :-)
-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1...@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang



[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread wang Tiger
在 2010年7月23日 下午3:53,Jan Kiszka  写道:
> wang Tiger wrote:
>> 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
>>> 2010/7/22 wang Tiger :
 In our implementation for x86_64 target, all devices except LAPIC are
 emulated in a seperate thread. VCPUs are emulated  in other threads
 (one thread per VCPU).
 By observing some device drivers in linux, we have a hypothethis that
 drivers in OS have already ensured correct synchronization on
 concurrent hardware accesses.
>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>> it is only executed in a single-threaded fashion, but guests can
>>> execute it in parallel, then this opens up the possibility of race
>>> conditions that malicious guests can exploit.  There needs to be
>>> isolation: a guest should not be able to cause QEMU to crash.
>>
>> In our prototype, we assume the guest behaves correctly. If hardware
>> emulation code can ensure atomic access(behave like real hardware),
>> VCPUS can access device freely.  We actually refine some hardward
>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>> hardware access.
>
> This approach is surely helpful for a prototype to explore the limits.
> But it's not applicable to production systems. It would create a huge
> source of potential subtle regressions for other guest OSes,
> specifically those that you cannot analyze regarding synchronized
> hardware access. We must play safe.
>
> That's why we currently have the global mutex. Its conversion can only
> happen step-wise, e.g. by establishing an infrastructure to declare the
> need of device models for that Big Lock. Then you can start converting
> individual models to private locks or even smart lock-less patterns.
>
> Jan
>
>
I agree with you on this point. The approach we used is really helpful
for a research prototype. But it needs a lot of work to make it
applicable to production systems.
Its my pleasure if we can tackle this issue togethor.

-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1...@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang



Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread Jan Kiszka
Stefan Hajnoczi wrote:
> 2010/7/23 Alexander Graf :
>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>
>>> wang Tiger wrote:
 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
> 2010/7/22 wang Tiger :
>> In our implementation for x86_64 target, all devices except LAPIC are
>> emulated in a seperate thread. VCPUs are emulated  in other threads
>> (one thread per VCPU).
>> By observing some device drivers in linux, we have a hypothethis that
>> drivers in OS have already ensured correct synchronization on
>> concurrent hardware accesses.
> This hypothesis is too optimistic.  If hardware emulation code assumes
> it is only executed in a single-threaded fashion, but guests can
> execute it in parallel, then this opens up the possibility of race
> conditions that malicious guests can exploit.  There needs to be
> isolation: a guest should not be able to cause QEMU to crash.
 In our prototype, we assume the guest behaves correctly. If hardware
 emulation code can ensure atomic access(behave like real hardware),
 VCPUS can access device freely.  We actually refine some hardward
 emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
 hardware access.
>>> This approach is surely helpful for a prototype to explore the limits.
>>> But it's not applicable to production systems. It would create a huge
>>> source of potential subtle regressions for other guest OSes,
>>> specifically those that you cannot analyze regarding synchronized
>>> hardware access. We must play safe.
>>>
>>> That's why we currently have the global mutex. Its conversion can only
>>> happen step-wise, e.g. by establishing an infrastructure to declare the
>>> need of device models for that Big Lock. Then you can start converting
>>> individual models to private locks or even smart lock-less patterns.
>> But isn't that independent from making TCG atomic capable and parallel? At 
>> that point a TCG vCPU would have the exact same issues and interfaces as a 
>> KVM vCPU, right? And then we can tackle the concurrent device access issues 
>> together.
> 
> An issue that might affect COREMU today is core QEMU subsystems that
> are not thread-safe and used from hardware emulation, for example:
> 
> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
> This function moves the found RAMBlock to the head of the global RAM
> blocks list in a non-atomic way.  Therefore, two unrelated hardware
> devices executing cpu_physical_memory_*() simultaneously face a race
> condition.  I have seen this happen when playing with parallel
> hardware emulation.

Those issues need to be identified and, in a first step, worked around
by holding dedicated locks or just the global mutex. Maybe the above
conflict can also directly be resolved by creating per-VCPU lookup lists
(likely more efficient than tapping on other VCPU shoes by constantly
reordering a global list). Likely a good example for a self-contained
preparatory patch.

However, getting concurrency right is tricky enough. We should really be
careful with turning to much upside down in a rush. Even if TCG may have
some deeper hooks into the device model or thread-unsafe core parts than
KVM, parallelizing it can and should remain a separate topic. And we
also have to keep an eye on performance if a bit less than 255 VCPUs
shall be emulated.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread Stefan Hajnoczi
2010/7/23 Alexander Graf :
>
> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>
>> wang Tiger wrote:
>>> 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
 2010/7/22 wang Tiger :
> In our implementation for x86_64 target, all devices except LAPIC are
> emulated in a seperate thread. VCPUs are emulated  in other threads
> (one thread per VCPU).
> By observing some device drivers in linux, we have a hypothethis that
> drivers in OS have already ensured correct synchronization on
> concurrent hardware accesses.
 This hypothesis is too optimistic.  If hardware emulation code assumes
 it is only executed in a single-threaded fashion, but guests can
 execute it in parallel, then this opens up the possibility of race
 conditions that malicious guests can exploit.  There needs to be
 isolation: a guest should not be able to cause QEMU to crash.
>>>
>>> In our prototype, we assume the guest behaves correctly. If hardware
>>> emulation code can ensure atomic access(behave like real hardware),
>>> VCPUS can access device freely.  We actually refine some hardward
>>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>>> hardware access.
>>
>> This approach is surely helpful for a prototype to explore the limits.
>> But it's not applicable to production systems. It would create a huge
>> source of potential subtle regressions for other guest OSes,
>> specifically those that you cannot analyze regarding synchronized
>> hardware access. We must play safe.
>>
>> That's why we currently have the global mutex. Its conversion can only
>> happen step-wise, e.g. by establishing an infrastructure to declare the
>> need of device models for that Big Lock. Then you can start converting
>> individual models to private locks or even smart lock-less patterns.
>
> But isn't that independent from making TCG atomic capable and parallel? At 
> that point a TCG vCPU would have the exact same issues and interfaces as a 
> KVM vCPU, right? And then we can tackle the concurrent device access issues 
> together.

An issue that might affect COREMU today is core QEMU subsystems that
are not thread-safe and used from hardware emulation, for example:

cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
This function moves the found RAMBlock to the head of the global RAM
blocks list in a non-atomic way.  Therefore, two unrelated hardware
devices executing cpu_physical_memory_*() simultaneously face a race
condition.  I have seen this happen when playing with parallel
hardware emulation.

Tiger: If you are only locking the hardware thread for ARM target,
your hardware emulation is not safe for other targets.  Have I missed
something in the COREMU patch that defends against this problem?

Stefan



Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread Alexander Graf

On 23.07.2010, at 09:53, Jan Kiszka wrote:

> wang Tiger wrote:
>> 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
>>> 2010/7/22 wang Tiger :
 In our implementation for x86_64 target, all devices except LAPIC are
 emulated in a seperate thread. VCPUs are emulated  in other threads
 (one thread per VCPU).
 By observing some device drivers in linux, we have a hypothethis that
 drivers in OS have already ensured correct synchronization on
 concurrent hardware accesses.
>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>> it is only executed in a single-threaded fashion, but guests can
>>> execute it in parallel, then this opens up the possibility of race
>>> conditions that malicious guests can exploit.  There needs to be
>>> isolation: a guest should not be able to cause QEMU to crash.
>> 
>> In our prototype, we assume the guest behaves correctly. If hardware
>> emulation code can ensure atomic access(behave like real hardware),
>> VCPUS can access device freely.  We actually refine some hardward
>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>> hardware access.
> 
> This approach is surely helpful for a prototype to explore the limits.
> But it's not applicable to production systems. It would create a huge
> source of potential subtle regressions for other guest OSes,
> specifically those that you cannot analyze regarding synchronized
> hardware access. We must play safe.
> 
> That's why we currently have the global mutex. Its conversion can only
> happen step-wise, e.g. by establishing an infrastructure to declare the
> need of device models for that Big Lock. Then you can start converting
> individual models to private locks or even smart lock-less patterns.

But isn't that independent from making TCG atomic capable and parallel? At that 
point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, 
right? And then we can tackle the concurrent device access issues together.


Alex




[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-23 Thread Jan Kiszka
wang Tiger wrote:
> 在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
>> 2010/7/22 wang Tiger :
>>> In our implementation for x86_64 target, all devices except LAPIC are
>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>> (one thread per VCPU).
>>> By observing some device drivers in linux, we have a hypothethis that
>>> drivers in OS have already ensured correct synchronization on
>>> concurrent hardware accesses.
>> This hypothesis is too optimistic.  If hardware emulation code assumes
>> it is only executed in a single-threaded fashion, but guests can
>> execute it in parallel, then this opens up the possibility of race
>> conditions that malicious guests can exploit.  There needs to be
>> isolation: a guest should not be able to cause QEMU to crash.
> 
> In our prototype, we assume the guest behaves correctly. If hardware
> emulation code can ensure atomic access(behave like real hardware),
> VCPUS can access device freely.  We actually refine some hardward
> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
> hardware access.

This approach is surely helpful for a prototype to explore the limits.
But it's not applicable to production systems. It would create a huge
source of potential subtle regressions for other guest OSes,
specifically those that you cannot analyze regarding synchronized
hardware access. We must play safe.

That's why we currently have the global mutex. Its conversion can only
happen step-wise, e.g. by establishing an infrastructure to declare the
need of device models for that Big Lock. Then you can start converting
individual models to private locks or even smart lock-less patterns.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread wang Tiger
在 2010年7月22日 下午11:47,Stefan Hajnoczi  写道:
> 2010/7/22 wang Tiger :
>> 在 2010年7月22日 下午9:00,Jan Kiszka  写道:
>>> Stefan Hajnoczi wrote:
 On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei  wrote:
> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>
>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>
>>>
 On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  
 wrote:

> We are pleased to announce COREMU, which is a 
> "multicore-on-multicore" full-system emulator built on Qemu. (Simply 
> speaking, we made Qemu parallel.)
>
> The project web page is located at:
> http://ppi.fudan.edu.cn/coremu
>
> You can also download the source code, images for playing on 
> sourceforge
> http://sf.net/p/coremu
>
> COREMU is composed of
> 1. a parallel emulation library
> 2. a set of patches to qemu
> (We worked on the master branch, commit 
> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>
> It currently supports full-system emulation of x64 and ARM MPcore 
> platforms.
>
> By leveraging the underlying multicore resources, it can emulate up 
> to 255 cores running commodity operating systems (even on a 4-core 
> machine).
>
> Enjoy,
>
 Nice work. Do you plan to submit the improvements back to upstream 
 QEMU?

>>> It would be great if we can submit our code to QEMU, but we do not know 
>>> the process.
>>> Would you please give us some instructions?
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>>
>> Some hints can be found here:
>> http://wiki.qemu.org/Contribute/StartHere
>>
>> Kind regards,
>> Stefan Weil
> The patch is in the attachment, produced with command
> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>
> In order to separate what need to be done to make QEMU parallel, we 
> created a separate library, and the patched QEMU need to be compiled and 
> linked with that library. To submit our enhancement to QEMU, maybe we 
> need to incorporate this library into QEMU. I don't know what would be 
> the best solution.
>
> Our approach to make QEMU parallel can be found at 
> http://ppi.fudan.edu.cn/coremu
>
> I will give a short summary here:
>
> 1. Each emulated core thread runs a separate binary translator engine and 
> has private code cache. We marked some variables in TCG as thread local. 
> We also modified the TB invalidation mechanism.
>
> 2. Each core has a queue holding pending interrupts. The COREMU library 
> provides this queue, and interrupt notification is done by sending 
> realtime signals to the emulated core thread.
>
> 3. Atomic instruction emulation has to be modified for parallel 
> emulation. We use lightweight memory transaction which requires only 
> compare-and-swap instruction to emulate atomic instruction.
>
> 4. Some code in the original QEMU may cause data race bug after we make 
> it parallel. We fixed these problems.
>
>
>
>
> --
> Best regards,
> Chen Yufei

 Looking at the patch it seems there is a global lock for hardware
 access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
 tried running and do you have lock contention data for cm_hw_lock?
>>
>> The global lock for hardware access is only for ARM target in our
>> implementation. It is mainly because that we are not quite familiar
>> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
>> way.
>> For x86_64 target, we have already made hardware emulation
>> concurrently accessed. We can emulate 255 cores on a quad-core
>> machine.
>>
>>>
>>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
>>> sleepy lock here which is likely better for the code paths protected by
>>> it in upstream. Are they shorter in COREMU?
>>>
 Have you thought about making hardware emulation concurrent?

 These are issues that qemu-kvm faces today since it executes vcpu
 threads in parallel.  Both qemu-kvm and the COREMU patches could
 benefit from a solution for concurrent hardware access.
>>
>> In our implementation for x86_64 target, all devices except LAPIC are
>> emulated in a seperate thread. VCPUs are emulated  in other threads
>> (one thread per VCPU).
>> By observing some device drivers in linux, we have a hypothethis that
>> drivers in OS have already ensured correct synchronization on
>> concurrent hardware accesses.
>
> This hypothesis is too optimistic.  If hardware emulation code assumes
> it is only executed in a single-threaded fashion, but guests can
> execute it in parallel, then this opens up the possibility of race
> conditions that malicious guests can exploit.  T

Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread Stefan Hajnoczi
2010/7/22 wang Tiger :
> 在 2010年7月22日 下午9:00,Jan Kiszka  写道:
>> Stefan Hajnoczi wrote:
>>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei  wrote:
 On 2010-7-22, at 上午1:04, Stefan Weil wrote:

> Am 21.07.2010 09:03, schrieb Chen Yufei:
>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>
>>
>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  wrote:
>>>
 We are pleased to announce COREMU, which is a "multicore-on-multicore" 
 full-system emulator built on Qemu. (Simply speaking, we made Qemu 
 parallel.)

 The project web page is located at:
 http://ppi.fudan.edu.cn/coremu

 You can also download the source code, images for playing on 
 sourceforge
 http://sf.net/p/coremu

 COREMU is composed of
 1. a parallel emulation library
 2. a set of patches to qemu
 (We worked on the master branch, commit 
 54d7cf136f040713095cbc064f62d753bff6f9d2)

 It currently supports full-system emulation of x64 and ARM MPcore 
 platforms.

 By leveraging the underlying multicore resources, it can emulate up to 
 255 cores running commodity operating systems (even on a 4-core 
 machine).

 Enjoy,

>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>
>> It would be great if we can submit our code to QEMU, but we do not know 
>> the process.
>> Would you please give us some instructions?
>>
>> --
>> Best regards,
>> Chen Yufei
>>
> Some hints can be found here:
> http://wiki.qemu.org/Contribute/StartHere
>
> Kind regards,
> Stefan Weil
 The patch is in the attachment, produced with command
 git diff 54d7cf136f040713095cbc064f62d753bff6f9d2

 In order to separate what need to be done to make QEMU parallel, we 
 created a separate library, and the patched QEMU need to be compiled and 
 linked with that library. To submit our enhancement to QEMU, maybe we need 
 to incorporate this library into QEMU. I don't know what would be the best 
 solution.

 Our approach to make QEMU parallel can be found at 
 http://ppi.fudan.edu.cn/coremu

 I will give a short summary here:

 1. Each emulated core thread runs a separate binary translator engine and 
 has private code cache. We marked some variables in TCG as thread local. 
 We also modified the TB invalidation mechanism.

 2. Each core has a queue holding pending interrupts. The COREMU library 
 provides this queue, and interrupt notification is done by sending 
 realtime signals to the emulated core thread.

 3. Atomic instruction emulation has to be modified for parallel emulation. 
 We use lightweight memory transaction which requires only compare-and-swap 
 instruction to emulate atomic instruction.

 4. Some code in the original QEMU may cause data race bug after we make it 
 parallel. We fixed these problems.




 --
 Best regards,
 Chen Yufei
>>>
>>> Looking at the patch it seems there is a global lock for hardware
>>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>>> tried running and do you have lock contention data for cm_hw_lock?
>
> The global lock for hardware access is only for ARM target in our
> implementation. It is mainly because that we are not quite familiar
> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
> way.
> For x86_64 target, we have already made hardware emulation
> concurrently accessed. We can emulate 255 cores on a quad-core
> machine.
>
>>
>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
>> sleepy lock here which is likely better for the code paths protected by
>> it in upstream. Are they shorter in COREMU?
>>
>>> Have you thought about making hardware emulation concurrent?
>>>
>>> These are issues that qemu-kvm faces today since it executes vcpu
>>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>>> benefit from a solution for concurrent hardware access.
>
> In our implementation for x86_64 target, all devices except LAPIC are
> emulated in a seperate thread. VCPUs are emulated  in other threads
> (one thread per VCPU).
> By observing some device drivers in linux, we have a hypothethis that
> drivers in OS have already ensured correct synchronization on
> concurrent hardware accesses.

This hypothesis is too optimistic.  If hardware emulation code assumes
it is only executed in a single-threaded fashion, but guests can
execute it in parallel, then this opens up the possibility of race
conditions that malicious guests can exploit.  There needs to be
isolation: a guest should not be able to cause QEMU to crash.

If you have one hardware thread that handles all device emulation and
vcpu threads do no hardware emulation, t

Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread wang Tiger
在 2010年7月22日 下午9:00,Jan Kiszka  写道:
> Stefan Hajnoczi wrote:
>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei  wrote:
>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>
 Am 21.07.2010 09:03, schrieb Chen Yufei:
> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>
>
>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  wrote:
>>
>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" 
>>> full-system emulator built on Qemu. (Simply speaking, we made Qemu 
>>> parallel.)
>>>
>>> The project web page is located at:
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> You can also download the source code, images for playing on sourceforge
>>> http://sf.net/p/coremu
>>>
>>> COREMU is composed of
>>> 1. a parallel emulation library
>>> 2. a set of patches to qemu
>>> (We worked on the master branch, commit 
>>> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>
>>> It currently supports full-system emulation of x64 and ARM MPcore 
>>> platforms.
>>>
>>> By leveraging the underlying multicore resources, it can emulate up to 
>>> 255 cores running commodity operating systems (even on a 4-core 
>>> machine).
>>>
>>> Enjoy,
>>>
>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>
> It would be great if we can submit our code to QEMU, but we do not know 
> the process.
> Would you please give us some instructions?
>
> --
> Best regards,
> Chen Yufei
>
 Some hints can be found here:
 http://wiki.qemu.org/Contribute/StartHere

 Kind regards,
 Stefan Weil
>>> The patch is in the attachment, produced with command
>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>
>>> In order to separate what need to be done to make QEMU parallel, we created 
>>> a separate library, and the patched QEMU need to be compiled and linked 
>>> with that library. To submit our enhancement to QEMU, maybe we need to 
>>> incorporate this library into QEMU. I don't know what would be the best 
>>> solution.
>>>
>>> Our approach to make QEMU parallel can be found at 
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> I will give a short summary here:
>>>
>>> 1. Each emulated core thread runs a separate binary translator engine and 
>>> has private code cache. We marked some variables in TCG as thread local. We 
>>> also modified the TB invalidation mechanism.
>>>
>>> 2. Each core has a queue holding pending interrupts. The COREMU library 
>>> provides this queue, and interrupt notification is done by sending realtime 
>>> signals to the emulated core thread.
>>>
>>> 3. Atomic instruction emulation has to be modified for parallel emulation. 
>>> We use lightweight memory transaction which requires only compare-and-swap 
>>> instruction to emulate atomic instruction.
>>>
>>> 4. Some code in the original QEMU may cause data race bug after we make it 
>>> parallel. We fixed these problems.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>
>> Looking at the patch it seems there is a global lock for hardware
>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>> tried running and do you have lock contention data for cm_hw_lock?

The global lock for hardware access is only for ARM target in our
implementation. It is mainly because that we are not quite familiar
with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
way.
For x86_64 target, we have already made hardware emulation
concurrently accessed. We can emulate 255 cores on a quad-core
machine.

>
> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
> sleepy lock here which is likely better for the code paths protected by
> it in upstream. Are they shorter in COREMU?
>
>> Have you thought about making hardware emulation concurrent?
>>
>> These are issues that qemu-kvm faces today since it executes vcpu
>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>> benefit from a solution for concurrent hardware access.

In our implementation for x86_64 target, all devices except LAPIC are
emulated in a seperate thread. VCPUs are emulated  in other threads
(one thread per VCPU).
By observing some device drivers in linux, we have a hypothethis that
drivers in OS have already ensured correct synchronization on
concurrent hardware accesses.

For example, when emulating IDE with bus master DMA,
1. Two VCPUs will not send disk w/r requests at the same time.
2. New DMA request will not be sent until the previous one has completed.
These two points guarantee the emulated IDE with DMA can be
concurrently accessed by either VCPU thread or hw thread with no
additional locks.

The only work we need to do is to fix some misbehaving emulated device
in current Qemu.
For example, in the function ide_write_dma_cb of Qemu

if (s->nsector == 0) {
s->status = READY_STAT | SEEK_STAT;
ide_set_irq(s->bus);
/* In parallel emulation, OS ma

[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread Stefan Hajnoczi
2010/7/22 Jan Kiszka :
> Stefan Hajnoczi wrote:
>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei  wrote:
>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>
 Am 21.07.2010 09:03, schrieb Chen Yufei:
> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>
>
>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  wrote:
>>
>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" 
>>> full-system emulator built on Qemu. (Simply speaking, we made Qemu 
>>> parallel.)
>>>
>>> The project web page is located at:
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> You can also download the source code, images for playing on sourceforge
>>> http://sf.net/p/coremu
>>>
>>> COREMU is composed of
>>> 1. a parallel emulation library
>>> 2. a set of patches to qemu
>>> (We worked on the master branch, commit 
>>> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>
>>> It currently supports full-system emulation of x64 and ARM MPcore 
>>> platforms.
>>>
>>> By leveraging the underlying multicore resources, it can emulate up to 
>>> 255 cores running commodity operating systems (even on a 4-core 
>>> machine).
>>>
>>> Enjoy,
>>>
>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>
> It would be great if we can submit our code to QEMU, but we do not know 
> the process.
> Would you please give us some instructions?
>
> --
> Best regards,
> Chen Yufei
>
 Some hints can be found here:
 http://wiki.qemu.org/Contribute/StartHere

 Kind regards,
 Stefan Weil
>>> The patch is in the attachment, produced with command
>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>
>>> In order to separate what need to be done to make QEMU parallel, we created 
>>> a separate library, and the patched QEMU need to be compiled and linked 
>>> with that library. To submit our enhancement to QEMU, maybe we need to 
>>> incorporate this library into QEMU. I don't know what would be the best 
>>> solution.
>>>
>>> Our approach to make QEMU parallel can be found at 
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> I will give a short summary here:
>>>
>>> 1. Each emulated core thread runs a separate binary translator engine and 
>>> has private code cache. We marked some variables in TCG as thread local. We 
>>> also modified the TB invalidation mechanism.
>>>
>>> 2. Each core has a queue holding pending interrupts. The COREMU library 
>>> provides this queue, and interrupt notification is done by sending realtime 
>>> signals to the emulated core thread.
>>>
>>> 3. Atomic instruction emulation has to be modified for parallel emulation. 
>>> We use lightweight memory transaction which requires only compare-and-swap 
>>> instruction to emulate atomic instruction.
>>>
>>> 4. Some code in the original QEMU may cause data race bug after we make it 
>>> parallel. We fixed these problems.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>
>> Looking at the patch it seems there is a global lock for hardware
>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>> tried running and do you have lock contention data for cm_hw_lock?
>
> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
> sleepy lock here which is likely better for the code paths protected by
> it in upstream. Are they shorter in COREMU?
>
>> Have you thought about making hardware emulation concurrent?
>>
>> These are issues that qemu-kvm faces today since it executes vcpu
>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>> benefit from a solution for concurrent hardware access.
>
> While we are all looking forward to see more scalable hardware models
> :), I think it is a topic that can be addressed widely independent of
> parallelizing TCG VCPUs. The latter can benefit from the former, for
> sure, but it first of all has to solve its own issues.

Right, but it's worth discussing with people who have worked on
parallel vcpus from a different angle.

> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
> upstream these days. Just for TCG, we need that sightly suboptimal CPU
> scheduling inside single-threaded tcg_cpu_exec (was renamed to
> cpu_exec_all today).
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
>



[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread Jan Kiszka
Stefan Hajnoczi wrote:
> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei  wrote:
>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>
>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
 On 2010-7-21, at 上午5:43, Blue Swirl wrote:


> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  wrote:
>
>> We are pleased to announce COREMU, which is a "multicore-on-multicore" 
>> full-system emulator built on Qemu. (Simply speaking, we made Qemu 
>> parallel.)
>>
>> The project web page is located at:
>> http://ppi.fudan.edu.cn/coremu
>>
>> You can also download the source code, images for playing on sourceforge
>> http://sf.net/p/coremu
>>
>> COREMU is composed of
>> 1. a parallel emulation library
>> 2. a set of patches to qemu
>> (We worked on the master branch, commit 
>> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>
>> It currently supports full-system emulation of x64 and ARM MPcore 
>> platforms.
>>
>> By leveraging the underlying multicore resources, it can emulate up to 
>> 255 cores running commodity operating systems (even on a 4-core machine).
>>
>> Enjoy,
>>
> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>
 It would be great if we can submit our code to QEMU, but we do not know 
 the process.
 Would you please give us some instructions?

 --
 Best regards,
 Chen Yufei

>>> Some hints can be found here:
>>> http://wiki.qemu.org/Contribute/StartHere
>>>
>>> Kind regards,
>>> Stefan Weil
>> The patch is in the attachment, produced with command
>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>
>> In order to separate what need to be done to make QEMU parallel, we created 
>> a separate library, and the patched QEMU need to be compiled and linked with 
>> that library. To submit our enhancement to QEMU, maybe we need to 
>> incorporate this library into QEMU. I don't know what would be the best 
>> solution.
>>
>> Our approach to make QEMU parallel can be found at 
>> http://ppi.fudan.edu.cn/coremu
>>
>> I will give a short summary here:
>>
>> 1. Each emulated core thread runs a separate binary translator engine and 
>> has private code cache. We marked some variables in TCG as thread local. We 
>> also modified the TB invalidation mechanism.
>>
>> 2. Each core has a queue holding pending interrupts. The COREMU library 
>> provides this queue, and interrupt notification is done by sending realtime 
>> signals to the emulated core thread.
>>
>> 3. Atomic instruction emulation has to be modified for parallel emulation. 
>> We use lightweight memory transaction which requires only compare-and-swap 
>> instruction to emulate atomic instruction.
>>
>> 4. Some code in the original QEMU may cause data race bug after we make it 
>> parallel. We fixed these problems.
>>
>>
>>
>>
>> --
>> Best regards,
>> Chen Yufei
> 
> Looking at the patch it seems there is a global lock for hardware
> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
> tried running and do you have lock contention data for cm_hw_lock?

BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
sleepy lock here which is likely better for the code paths protected by
it in upstream. Are they shorter in COREMU?

> Have you thought about making hardware emulation concurrent?
> 
> These are issues that qemu-kvm faces today since it executes vcpu
> threads in parallel.  Both qemu-kvm and the COREMU patches could
> benefit from a solution for concurrent hardware access.

While we are all looking forward to see more scalable hardware models
:), I think it is a topic that can be addressed widely independent of
parallelizing TCG VCPUs. The latter can benefit from the former, for
sure, but it first of all has to solve its own issues.

Note that --enable-io-thread provides truly parallel KVM VCPUs also in
upstream these days. Just for TCG, we need that sightly suboptimal CPU
scheduling inside single-threaded tcg_cpu_exec (was renamed to
cpu_exec_all today).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux



[Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator

2010-07-22 Thread Jan Kiszka
Chen Yufei wrote:
> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
> 
>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>
>>>   
 On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei  wrote:
 
> We are pleased to announce COREMU, which is a "multicore-on-multicore" 
> full-system emulator built on Qemu. (Simply speaking, we made Qemu 
> parallel.)
>
> The project web page is located at:
> http://ppi.fudan.edu.cn/coremu
>
> You can also download the source code, images for playing on sourceforge
> http://sf.net/p/coremu
>
> COREMU is composed of
> 1. a parallel emulation library
> 2. a set of patches to qemu
> (We worked on the master branch, commit 
> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>
> It currently supports full-system emulation of x64 and ARM MPcore 
> platforms.
>
> By leveraging the underlying multicore resources, it can emulate up to 
> 255 cores running commodity operating systems (even on a 4-core machine).
>
> Enjoy,
>   
 Nice work. Do you plan to submit the improvements back to upstream QEMU?
 
>>> It would be great if we can submit our code to QEMU, but we do not know the 
>>> process.
>>> Would you please give us some instructions?
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>>   
>> Some hints can be found here:
>> http://wiki.qemu.org/Contribute/StartHere
>>
>> Kind regards,
>> Stefan Weil
> 
> The patch is in the attachment, produced with command
> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
> 
> In order to separate what need to be done to make QEMU parallel, we created a 
> separate library, and the patched QEMU need to be compiled and linked with 
> that library. To submit our enhancement to QEMU, maybe we need to incorporate 
> this library into QEMU. I don't know what would be the best solution.

For upstream QEMU, the goal should be to integrate your modifications
and enhancements into the existing architecture in a mostly seamless
way. The library approach may help maintaining your changes out of tree,
but it likely cannot contribute any benefit to an in-tree extension of
QEMU for parallel TCG VCPUs.

> 
> Our approach to make QEMU parallel can be found at 
> http://ppi.fudan.edu.cn/coremu
> 
> I will give a short summary here:
> 
> 1. Each emulated core thread runs a separate binary translator engine and has 
> private code cache. We marked some variables in TCG as thread local. We also 
> modified the TB invalidation mechanism.
> 
> 2. Each core has a queue holding pending interrupts. The COREMU library 
> provides this queue, and interrupt notification is done by sending realtime 
> signals to the emulated core thread.
> 
> 3. Atomic instruction emulation has to be modified for parallel emulation. We 
> use lightweight memory transaction which requires only compare-and-swap 
> instruction to emulate atomic instruction.
> 
> 4. Some code in the original QEMU may cause data race bug after we make it 
> parallel. We fixed these problems.
> 

Upstream integration requires such iterative steps as well - in form of
ideally small, focused patches that finally convert QEMU into a parallel
emulator.

Also note that upstream already supports threaded VCPUs - in KVM mode.
You obviously have resolved the major blocking points to apply this on
TCG mode as well. But I don't see yet why we may need a new VCPU
threading infrastructure for this. Rather only small tuning of what KVM
already uses should suffice - if that's required at all.

To give it a start, you could identify some more trivial changes in your
patches, split them out and rebase them over latest qemu.git, then post
them as a patch series for inclusion (see the mailing list for various
examples). Make sure to describe the reason for your changes as clear as
possible, specifically if they are not (yet) obvious in the absence of
COREMU features in upstream QEMU.

Be prepared that merging your code can be a lengthy process with quite a
few discussions about why and how things are done, likely also with
requests to change your current solution in some aspects. However, the
result should be an optimal solution for the overall goal, parallel VCPU
emulation - and no longer any need to maintain your private set of
patches against quickly evolving QEMU.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux