Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Chris Wright
* Anthony Liguori ([EMAIL PROTECTED]) wrote:
> Logically speaking, virtio is a bus.  virtio supports all of the features 
> of a bus (discover, hot add, hot remove).
>
> Right now, we map virtio devices directly onto the PCI bus.
>
> The problem we're trying to address is limitations of the PCI bus.  We have 
> a couple options:

First question is if we have a real limitation with multiple busses?

> 1) add a virtio device that supports multiple disks.  we need to reinvent 
> hotplug within this device.
>
> 2) add a new PCI virtio transport that supports multiple virtio-blk devices 
> within a single PCI slot
>
> 3) add a generic PCI virtio transport that supports multiple virtio devices 
> within a single PCI slot

compare and contrast above with HBA and disks (makes most sense from my
point of view).  for 2 and 3, only difference is whether you want to be
able to support nics, balloons, and block devices on same pci slot (at
which point it's a bridge, how is it different from 4?)

> 4) add a generic virtio "bridge" that supports multiple virtio devices 
> within a single virtio device.
>
> #4 may seem strange, but it's no different from a PCI-to-PCI bridge.
>
> I like #4 the most, but #2 is probably the most practical.

Also, your current patch does not work for hotplug disk.

thanks,
-chris

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
> BTW, I've never been that convinced that hotplugging devices is as 
> useful as people make it out to be.  I also think that's particularly 
> true when it comes to hot adding/removing very large numbers of disks.
>
>   

On the contrary, the more disks you have, the more likely one is to 
fail, so you'd need to hotreplace it (think a setup with redundancy like 
zfs).

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Ian Kirk wrote:
> Avi Kivity wrote:
>
>   
>> For mass storage, we should follow the SCSI model with a single device
>> serving multiple disks, similar to what you suggest.  Not sure if the
>> device should have a single queue or one queue per disk.
>> 
>
> Don't you just end up re-implementing SCSI then, at which point you might
> as well stick with a 'fake' SCSI device in the guest?
>   

A virtio-scsi controller is indeed useful as it can control tapes, media 
changers, and other fancy stuff in addition to ordinary disks.  For 
disks, I'd like to avoid the overhead of scsi command generation and 
parsing.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()

2008-04-22 Thread Avi Kivity
Thomas Cataldo wrote:
> On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo
> <[EMAIL PROTECTED]> wrote:
>   
>> Hi,
>>
>>  I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit 
>> host).
>>
>>  Got the following in my logs today :
>>
>>  Apr 21 17:55:01 buffy kernel: WARNING: at
>>  /usr/src/modules/kvm/mmu.c:390 account_shadowed()
>>  Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P
>>  2.6.24-1-686 #1
>>  Apr 21 17:55:01 buffy kernel:  [] kvm_mmu_get_page+0x42d/0x447 
>> [kvm]
>>  Apr 21 17:55:01 buffy kernel:  [] kvm_mmu_load+0xdf/0x15c [kvm]
>>  Apr 21 17:55:01 buffy kernel:  []
>>  vmx_queue_exception+0x0/0x33 [kvm_intel]
>>  Apr 21 17:55:01 buffy kernel:  []
>>  kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm]
>>  Apr 21 17:55:01 buffy kernel:  [] kvm_vcpu_ioctl+0xe4/0x34c [kvm]
>>  Apr 21 17:55:01 buffy kernel:  [] delayacct_end+0x70/0x77
>>  Apr 21 17:55:01 buffy kernel:  [] sync_page+0x0/0x3b
>>  Apr 21 17:55:01 buffy kernel:  [] __delayacct_blkio_end+0x5b/0x5f
>>  Apr 21 17:55:01 buffy kernel:  [] io_schedule+0x64/0x80
>>  Apr 21 17:55:01 buffy kernel:  [] enqueue_entity+0x2b/0x3d
>>  Apr 21 17:55:01 buffy kernel:  [] apic_wait_icr_idle+0xe/0x15
>>  Apr 21 17:55:01 buffy kernel:  [] enqueue_task_fair+0x16/0x24
>>  Apr 21 17:55:01 buffy kernel:  [] enqueue_task+0x52/0x5d
>>  Apr 21 17:55:01 buffy kernel:  [] resched_task+0x52/0x54
>>  Apr 21 17:55:01 buffy kernel:  [] try_to_wake_up+0x2b8/0x2c2
>>  Apr 21 17:55:01 buffy kernel:  [] __wake_up_common+0x32/0x5c
>>  Apr 21 17:55:01 buffy kernel:  [] __wake_up+0x32/0x42
>>  Apr 21 17:55:01 buffy kernel:  [] wake_futex+0x3b/0x45
>>  Apr 21 17:55:01 buffy kernel:  [] futex_wake+0x81/0xb0
>>  Apr 21 17:55:01 buffy kernel:  [] do_futex+0x77/0x983
>>  Apr 21 17:55:01 buffy kernel:  [] update_curr+0x62/0xef
>>  Apr 21 17:55:01 buffy kernel:  [] __switch_to+0x9d/0x11d
>>  Apr 21 17:55:01 buffy kernel:  [] kvm_vcpu_ioctl+0x0/0x34c [kvm]
>>  Apr 21 17:55:01 buffy kernel:  [] do_ioctl+0x1f/0x62
>>  Apr 21 17:55:01 buffy kernel:  [] vfs_ioctl+0x237/0x249
>>  Apr 21 17:55:01 buffy kernel:  [] sys_ioctl+0x45/0x5d
>>  Apr 21 17:55:01 buffy kernel:  [] sysenter_past_esp+0x6b/0xa1
>>
>>
>>  Regards,
>>  Thomas.
>>
>> 
>
> as I got no reply, I guess it is a bad setup on my part. If that might
> help, this happenned while I was doing a "make -j" on webkit svn tree
> (ie. heavy c++ compilation workload) .
>
>   

No this is not bad setup.  No amount of bad setup should give this warning.

You didn't get a reply because no one knows what to make of it, and 
because it's much more fun to debate endianess or contemplete guests 
with eighty thousand disks than to fix those impossible bugs.  If you 
can give clear instructions on how to reproduce this, we will try it 
out.  Please be sure to state OS name and versions for the guest as well 
as the host.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Stupid Newbee Questions...

2008-04-22 Thread Avi Kivity
Stuart Sheldon wrote:
> 2) When I started writing the management scripts to start and stop the
> guests from the command line, I was using KVM-63 which allowed me to
> send a "system_powerdown" to the console, this would send a PWR to the
> guest's acpid that would bring the guest down gracefully. This stopped
> working in kvm-64 and is still not working in kvm-66. Is there a better
> way to do this? How can the host bring down a guest without messing with
> it's open programs?
>   

This is not a question, it's a bug report.  It's a clear regression that 
needs to be fixed, not worked around.

Please send timely reports of such issues when you encounter them.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-22 Thread Gerd Hoffmann
Glauber Costa wrote:
> Gerd Hoffmann wrote:
>> Jeremy Fitzhardinge wrote:
>>> Xen could change the parameters in the instant after
>>> get_time_values(). That change could be as a result of
>>> suspend-resume, so the parameters
>>> and the tsc could be wildly different.
>>
>> Ah, ok, forgot the rdtsc in the picture.  With that in mind I fully
>> agree that the loop is needed.  I think kvm guests can even hit that one
>> with the vcpu migrating to a different physical cpu, so we better handle
>> it correctly ;)
> 
> It's probably not needed for kvm, since we update everything everytime
> we get scheduled in the host side, which would cover the case for
> migration between physical cpus. 

No, it wouldn't.  The corner case we must catch is: guest reads time
info, kvm reschedules the guest to another pcpu, guest reads the tsc.
The time info used by the guest for the tsc delta is stale then, it
belongs to the previous pcpu.

cheers,
  Gerd

-- 
http://kraxel.fedorapeople.org/xenner/

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Stupid Newbee Questions...

2008-04-22 Thread Amit Shah
* On Wednesday 23 Apr 2008 05:20:03 Stuart Sheldon wrote:
> I've looked around but can't seem to find these answers.
>
> I'm using KVM to run multiple servers on the same hardware, but it seems
> that most of the documentation written is for desktop use.
>
> I'm currently running KVM-66 on a 2.6.24.4 kernel and using the KVM
> provided modules. This is sitting on a Debian Lenny install with Intel
> hardware.
>
> Here are my questions:
>
> 1) Is there a way to give one virtual host priority over another? Would
> I just add -cpu 2 on the host I want to give priority?

Renicing the guest you want to give more priority to is a straightforward way 
of boosting the priority. It's easy to imagine that, since each guest is just 
a process on the host.

> 2) When I started writing the management scripts to start and stop the
> guests from the command line, I was using KVM-63 which allowed me to
> send a "system_powerdown" to the console, this would send a PWR to the
> guest's acpid that would bring the guest down gracefully. This stopped
> working in kvm-64 and is still not working in kvm-66. Is there a better
> way to do this? How can the host bring down a guest without messing with
> it's open programs?

I've no clue about this one; will let someone else answer it.

> 3) Are there any statistics available from to the host os that can be
> monitored, such as something in /sys or /proc?

Can you mention what kinds of stats you're looking for? Since each guest is a 
normal process, a lot of information can be already examined by ps, top 
and /proc//...

There also is the kvm_stat that comes with the kvm-userspace which lets you 
monitor guest exits into the host. Those are mainly for debugging purposes 
though.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)

2008-04-22 Thread Avi Kivity
Marcelo Tosatti wrote:
> On Tue, Apr 22, 2008 at 12:39:57PM -0600, Alberto Treviño wrote:
>   
>> Thanks for all those who work on KVM.  It is a wonderful product and I 
>> have been very impressed with its features, performance, and the level 
>> of activity in this project.
>>
>> Back in February a bug was filed.  I've been hit by this bug as well, 
>> but there hasn't been much activity with it in the last little bit.  I 
>> wanted to know if anyone had a fix for it, or a workaround (other than 
>> using IDE), or whether it was on someone's radar.  Here is a link to 
>> the bug:
>>
>> http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831
>> 
>
> http://article.gmane.org/gmane.comp.emulators.qemu/24192
>
> BTW, Avi, this patch should be included in kvm-userspace.
>   

I've tried this out about a week ago and didn't get very good results.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Классификация запасов

2008-04-22 Thread Оценка поставщика
  В санкт-Петербурге в перид с 19 по 21 мая будет проходить
  информационный курс, посвященный логистике запасов.

 --
 Управление запасами в современной компании
 --

  19 - 21 мая 2008г.

  Санкт-Петербург

  Актуальность темы мероприятия:
  -
  ∙ Структурировать имеющийся опыт и знания в области современной логистики
с целью оптимизации управления запасами на предприятии;
  ∙ Освоить метод работы со схемой бизнес-процессов;
  ∙ Получить инструменты для выполнения рабочих операций на определенных
этапах закупок;
  ∙ Познакомиться с методами оптимизации процесса закупок/хранения.

  Темы, включенные в программу:
  

  Система логистики в компании:
  ∙ Функции отдела логистики.
  ∙ Основные блоки логистики.
  ∙ Основной результат работы логистики Компании - оптимизация расходов.

  Оптимальный выбор партнеров по бизнесу - минимум расходов:
  ∙ Технология выбора и оценки поставщика.
  ∙ Определение критериев для выбора поставщика.
  ∙ Процесс выбора поставщика. Ранжирование поставщиков.
  ∙ Методика оценки работы поставщика.
  ∙ Контроль складских запасов по показателям оборачиваемости.

  Проблематика управления запасами:
  ∙ Цели создания запасов и причины их повышения.
  ∙ Риски создания и поддержания запасов.
  ∙ Возможности снижения уровня запасов.
 
  Анализ и дифференциация ассортимента: 
  ∙ Неликвиды. Последствия дефицита. Взаимозаменяемые товары.
  ∙ АВС-анализ в управлении запасами. 
  ∙ Использование метода XYZ.
  ∙ Матрица ABC-XYZ и ее использование.

  Прогнозирование потребности в управлении запасами: 
  ∙ Статистические методы прогнозирования. 
  ∙ Учет сезонных колебаний. 
  ∙ Экспертная оценка.
 
  Отчетность при управлении запасами: 
  ∙ Показатели для контроля и анализа деятельности по управлению запасами. 
  ∙ Калькуляция затрат, связанных с управлением запасами. 
  ∙ Пути снижения затрат. 
  ∙ Разработка отчетов и периодичность их составления.
 
  Параметры поставок:
  ∙ Виды запасов. Классификация запасов.
  ∙ Логистический цикл запасов.
  ∙ Определение оптимального размера заказа.
  ∙ Потребность в страховых запасах.
 
  Системы управления запасами. 
  ∙ Модель с фиксированным размером заказа. 
  ∙ Модель с фиксированным интервалом времени между заказами. 
  ∙ Модель управления запасами с установленной периодичностью. 
  ∙ Модель управления запасами "Минимум-максимум". 
  ∙ JIT - Точно в срок. 
  ∙ Управление многономенклатурными запасами.
 
  Логистический прогноз: основы, методы, единицы измерения запасов на
  различных стадиях логистического процесса.
 
  Методы логистического планирования:
  ∙ Календарное планирование.
  ∙ Оценка запасов.
  ∙ Методика расчета оптимальных запасов. 

Подробная программа, условия участия и
дополнительная информация: (812) 98-35-439





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 12:39:57PM -0600, Alberto Treviño wrote:
> Thanks for all those who work on KVM.  It is a wonderful product and I 
> have been very impressed with its features, performance, and the level 
> of activity in this project.
> 
> Back in February a bug was filed.  I've been hit by this bug as well, 
> but there hasn't been much activity with it in the last little bit.  I 
> wanted to know if anyone had a fix for it, or a workaround (other than 
> using IDE), or whether it was on someone's radar.  Here is a link to 
> the bug:
> 
> http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831

http://article.gmane.org/gmane.comp.emulators.qemu/24192

BTW, Avi, this patch should be included in kvm-userspace.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Jack Steiner
On Tue, Apr 22, 2008 at 03:51:16PM +0200, Andrea Arcangeli wrote:
> Hello,
> 
> This is the latest and greatest version of the mmu notifier patch #v13.
> 

FWIW, I have updated the GRU driver to use this patch (plus the fixeups).
No problems. AFAICT, everything works.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Jack Steiner
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
> > The only other change I did has been to move mmu_notifier_unregister
> > at the end of the patchset after getting more questions about its
> > reliability and I documented a bit the rmmod requirements for
> > ->release. we'll think later if it makes sense to add it, nobody's
> > using it anyway.
> 
> XPMEM is using it.  GRU will be as well (probably already does).

Yeppp.

The GRU driver unregisters the notifier when all GRU mappings
are unmapped. I could make it work either way - either with or without
an unregister function. However, unregister is the most logical
action to take when all mappings have been destroyed.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 17:13:01 Anthony Liguori wrote:
> Hollis Blanchard wrote:
> > On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote:
> >   
> >> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
> >> 
> >>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
> >>>   
>  We may still regret not doing *everything* little-endian, but this
>  doesn't make it worse.
>  
> >>> Hmm, why *don't* we just do everything LE, including the ring?
> >>>   
> >> Mainly because when requirements are in doubt, simplicity wins, I think.
> >> 
> >
> > Well, I think the definition of simplicity is up for debate in this 
> > case... "LE everywhere" is much simpler than "it depends", IMHO.
> >   
> 
> You couldn't use the vringfd direct ring mapping optimization in KVM for 
> PPC without teaching the kernel to access a vring in LE format.  I'm 
> pretty sure the later would get rejected on LKML anyway for vringfd as a 
> generic mechanism.

(Since the IA64 guys have already implemented BE guests on LE hosts, they 
should be aware of this discussion too, which is why I've CCed them.)

After a short but torturous whiteboard session, followed by a much longer but 
less painful discussion, I'm fine with the virtio device config space being 
BE for PowerPC and LE for x86.

In the future, we can use a feature bit to indicate that PCI config space 
contains an explicit endianness flag. (This will be set to BE or LE, *not* 
to "opposite of normal", because "normal" is also too vague.)

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Stupid Newbee Questions...

2008-04-22 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I've looked around but can't seem to find these answers.

I'm using KVM to run multiple servers on the same hardware, but it seems
that most of the documentation written is for desktop use.

I'm currently running KVM-66 on a 2.6.24.4 kernel and using the KVM
provided modules. This is sitting on a Debian Lenny install with Intel
hardware.

Here are my questions:

1) Is there a way to give one virtual host priority over another? Would
I just add -cpu 2 on the host I want to give priority?

2) When I started writing the management scripts to start and stop the
guests from the command line, I was using KVM-63 which allowed me to
send a "system_powerdown" to the console, this would send a PWR to the
guest's acpid that would bring the guest down gracefully. This stopped
working in kvm-64 and is still not working in kvm-66. Is there a better
way to do this? How can the host bring down a guest without messing with
it's open programs?

3) Are there any statistics available from to the host os that can be
monitored, such as something in /sys or /proc?

Thanks in advance to all that answer!

Stu



- --
Open up the window Let some air into this room I think I'm almost
chokin' From the smell of stale perfume And that cigarette you're
smoking 'Bout scared me half to death Open up the window, sucker
Let me catch my breath
  -- Three Dog Night - "Mama Told Me Not to Come - Lyrics"
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFIDnmrK69Y+xPZrWYRAp5yAKCev9Y6nDjILy4/z9hYGWnAFVUoTgCePAB6
esRDa/IPy3vb27KL13WKZBw=
=UHPE
-END PGP SIGNATURE-

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> I'll send an update in any case to Andrew way before Saturday so
> hopefully we'll finally get mmu-notifiers-core merged before next
> week. Also I'm not updating my mmu-notifier-core patch anymore except
> for strict bugfixes so don't worry about any more cosmetical bugs
> being introduced while optimizing the code like it happened this time.

I guess I have to prepare another patchset then?


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> The right patch ordering isn't necessarily the one that reduces the
> total number of lines in the patchsets. The mmu-notifier-core is
> already converged and can go in. The rest isn't converged at
> all... nearly nobody commented on the other part (the few comments so
> far were negative), so there's no good reason to delay indefinitely
> what is already converged, given it's already feature complete for
> certain users of the code. My patch ordering looks more natural to
> me. What is finished goes in, the rest is orthogonal anyway.

I would not want to review code that is later reverted or essentially 
changed in later patches. I only review your patches because we have a 
high interest in the patch. I suspect that others will be more willing to 
review this material if it would be done the right way.

If you cannot produce an easily reviewable and properly formatted patchset 
that follows conventions then I will have to do it because we really need 
to get this merged.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote:
> > Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?
> 
> To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the
> mmu notifiers aren't registered by GRU or KVM. Keep in mind that the
> whole point of my proposed patch ordering from day 0, is to keep as
> 1/N, the absolutely minimum change that fully satisfy GRU and KVM
> requirements. 4/12 isn't required by GRU/KVM so I keep it in a later
> patch. I now moved mmu_notifier_unregister in a later patch too for
> the same reason.

We want a full solution and this kind of patching makes the patches 
difficuilt to review because later patches revert earlier ones.
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Christoph Lameter
On Wed, 23 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote:
> > Missing signoff by you.
> 
> I thought I had to signoff if I conributed with anything that could
> resemble copyright? Given I only merged that patch, I can add an
> Acked-by if you like, but merging this in my patchset was already an
> implicit ack ;-).

No you have to include a signoff if the patch goes through your custody 
chain. This one did.

Also add a 

From: Christoph Lameter <[EMAIL PROTECTED]>

somewhere if you want to signify that the patch came from me. 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Robin Holt
On Wed, Apr 23, 2008 at 12:43:52AM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote:
> > Looks like this is not complete. There are numerous .h files missing which 
> > means that various structs are undefined (fs.h and rmap.h are needed 
> > f.e.) which leads to surprises when dereferencing fields of these struct.
> > 
> > It seems that mm_types.h is expected to be included only in certain 
> > contexts. Could you make sure to include all necessary .h files? Or add
> > some docs to clarify the situation here.
> 
> Robin, what other changes did you need to compile? I only did that one
> because I didn't hear any more feedback from you after I sent that
> patch, so I assumed it was enough.

It was perfect.  Nothing else was needed.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Robin Holt
> The only other change I did has been to move mmu_notifier_unregister
> at the end of the patchset after getting more questions about its
> reliability and I documented a bit the rmmod requirements for
> ->release. we'll think later if it makes sense to add it, nobody's
> using it anyway.

XPMEM is using it.  GRU will be as well (probably already does).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:26:13PM -0700, Christoph Lameter wrote:
> Doing the right patch ordering would have avoided this patch and allow 
> better review.

I didn't actually write this patch myself. This did it instead:

s/anon_vma_lock/anon_vma_sem/
s/i_mmap_lock/i_mmap_sem/
s/locks/sems/
s/spinlock_t/struct rw_semaphore/

so it didn't look a big deal to redo it indefinitely.

The right patch ordering isn't necessarily the one that reduces the
total number of lines in the patchsets. The mmu-notifier-core is
already converged and can go in. The rest isn't converged at
all... nearly nobody commented on the other part (the few comments so
far were negative), so there's no good reason to delay indefinitely
what is already converged, given it's already feature complete for
certain users of the code. My patch ordering looks more natural to
me. What is finished goes in, the rest is orthogonal anyway.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote:
> Looks like this is not complete. There are numerous .h files missing which 
> means that various structs are undefined (fs.h and rmap.h are needed 
> f.e.) which leads to surprises when dereferencing fields of these struct.
> 
> It seems that mm_types.h is expected to be included only in certain 
> contexts. Could you make sure to include all necessary .h files? Or add
> some docs to clarify the situation here.

Robin, what other changes did you need to compile? I only did that one
because I didn't hear any more feedback from you after I sent that
patch, so I assumed it was enough.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote:
> Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?

To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the
mmu notifiers aren't registered by GRU or KVM. Keep in mind that the
whole point of my proposed patch ordering from day 0, is to keep as
1/N, the absolutely minimum change that fully satisfy GRU and KVM
requirements. 4/12 isn't required by GRU/KVM so I keep it in a later
patch. I now moved mmu_notifier_unregister in a later patch too for
the same reason.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote:
> Missing signoff by you.

I thought I had to signoff if I conributed with anything that could
resemble copyright? Given I only merged that patch, I can add an
Acked-by if you like, but merging this in my patchset was already an
implicit ack ;-).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
> 3. As noted by Eric and also contained in private post from yesterday by 
>me: The cmp function needs to retrieve the value before
>doing comparisons which is not done for the == of a and b.

I retrieved the value, which is why mm_lock works perfectly on #v13 as
well as #v12. It's not mandatory to ever return 0, so it won't produce
any runtime error (there is a bugcheck for wrong sort ordering in my
patch just in case it would generate any runtime error and it never
did, or I would have noticed before submission), which is why I didn't
need to release any hotfix yet and I'm waiting more time to get more
comments before sending an update to clean up that bit.

Mentioning this as the third and last point I guess shows how strong
are your arguments against merging my mmu-notifier-core now, so in the
end doing that cosmetical error payed off somehow.

I'll send an update in any case to Andrew way before Saturday so
hopefully we'll finally get mmu-notifiers-core merged before next
week. Also I'm not updating my mmu-notifier-core patch anymore except
for strict bugfixes so don't worry about any more cosmetical bugs
being introduced while optimizing the code like it happened this time.

The only other change I did has been to move mmu_notifier_unregister
at the end of the patchset after getting more questions about its
reliability and I documented a bit the rmmod requirements for
->release. we'll think later if it makes sense to add it, nobody's
using it anyway.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 17:13:01 Anthony Liguori wrote:
> Hollis Blanchard wrote:
> > On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote:
> >   
> >> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
> >> 
> >>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
> >>>   
>  We may still regret not doing *everything* little-endian, but this
>  doesn't make it worse.
>  
> >>> Hmm, why *don't* we just do everything LE, including the ring?
> >>>   
> >> Mainly because when requirements are in doubt, simplicity wins, I think.
> >> 
> >
> > Well, I think the definition of simplicity is up for debate in this 
> > case... "LE everywhere" is much simpler than "it depends", IMHO.
> 
> You couldn't use the vringfd direct ring mapping optimization in KVM for 
> PPC without teaching the kernel to access a vring in LE format.  I'm 
> pretty sure the later would get rejected on LKML anyway for vringfd as a 
> generic mechanism.

You mean vringfd for use cases other than virtual IO drivers? I have a poor 
imagination; can you give some examples?

Even then, it should be possible to have VIO drivers use a different set of 
accessors, just like there are swapping and non-swapping accessors for real 
IO, so I still don't see the problem.

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Anthony Liguori
Hollis Blanchard wrote:
> On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote:
>   
>> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
>> 
>>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
>>>   
 We may still regret not doing *everything* little-endian, but this
 doesn't make it worse.
 
>>> Hmm, why *don't* we just do everything LE, including the ring?
>>>   
>> Mainly because when requirements are in doubt, simplicity wins, I think.
>> 
>
> Well, I think the definition of simplicity is up for debate in this 
> case... "LE everywhere" is much simpler than "it depends", IMHO.
>   

You couldn't use the vringfd direct ring mapping optimization in KVM for 
PPC without teaching the kernel to access a vring in LE format.  I'm 
pretty sure the later would get rejected on LKML anyway for vringfd as a 
generic mechanism.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()

2008-04-22 Thread Thomas Cataldo
On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo
<[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit 
> host).
>
>  Got the following in my logs today :
>
>  Apr 21 17:55:01 buffy kernel: WARNING: at
>  /usr/src/modules/kvm/mmu.c:390 account_shadowed()
>  Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P
>  2.6.24-1-686 #1
>  Apr 21 17:55:01 buffy kernel:  [] kvm_mmu_get_page+0x42d/0x447 
> [kvm]
>  Apr 21 17:55:01 buffy kernel:  [] kvm_mmu_load+0xdf/0x15c [kvm]
>  Apr 21 17:55:01 buffy kernel:  []
>  vmx_queue_exception+0x0/0x33 [kvm_intel]
>  Apr 21 17:55:01 buffy kernel:  []
>  kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm]
>  Apr 21 17:55:01 buffy kernel:  [] kvm_vcpu_ioctl+0xe4/0x34c [kvm]
>  Apr 21 17:55:01 buffy kernel:  [] delayacct_end+0x70/0x77
>  Apr 21 17:55:01 buffy kernel:  [] sync_page+0x0/0x3b
>  Apr 21 17:55:01 buffy kernel:  [] __delayacct_blkio_end+0x5b/0x5f
>  Apr 21 17:55:01 buffy kernel:  [] io_schedule+0x64/0x80
>  Apr 21 17:55:01 buffy kernel:  [] enqueue_entity+0x2b/0x3d
>  Apr 21 17:55:01 buffy kernel:  [] apic_wait_icr_idle+0xe/0x15
>  Apr 21 17:55:01 buffy kernel:  [] enqueue_task_fair+0x16/0x24
>  Apr 21 17:55:01 buffy kernel:  [] enqueue_task+0x52/0x5d
>  Apr 21 17:55:01 buffy kernel:  [] resched_task+0x52/0x54
>  Apr 21 17:55:01 buffy kernel:  [] try_to_wake_up+0x2b8/0x2c2
>  Apr 21 17:55:01 buffy kernel:  [] __wake_up_common+0x32/0x5c
>  Apr 21 17:55:01 buffy kernel:  [] __wake_up+0x32/0x42
>  Apr 21 17:55:01 buffy kernel:  [] wake_futex+0x3b/0x45
>  Apr 21 17:55:01 buffy kernel:  [] futex_wake+0x81/0xb0
>  Apr 21 17:55:01 buffy kernel:  [] do_futex+0x77/0x983
>  Apr 21 17:55:01 buffy kernel:  [] update_curr+0x62/0xef
>  Apr 21 17:55:01 buffy kernel:  [] __switch_to+0x9d/0x11d
>  Apr 21 17:55:01 buffy kernel:  [] kvm_vcpu_ioctl+0x0/0x34c [kvm]
>  Apr 21 17:55:01 buffy kernel:  [] do_ioctl+0x1f/0x62
>  Apr 21 17:55:01 buffy kernel:  [] vfs_ioctl+0x237/0x249
>  Apr 21 17:55:01 buffy kernel:  [] sys_ioctl+0x45/0x5d
>  Apr 21 17:55:01 buffy kernel:  [] sysenter_past_esp+0x6b/0xa1
>
>
>  Regards,
>  Thomas.
>

as I got no reply, I guess it is a bad setup on my part. If that might
help, this happenned while I was doing a "make -j" on webkit svn tree
(ie. heavy c++ compilation workload) .

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote:
> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
> > On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
> > > We may still regret not doing *everything* little-endian, but this
> > > doesn't make it worse.
> >
> > Hmm, why *don't* we just do everything LE, including the ring?
> 
> Mainly because when requirements are in doubt, simplicity wins, I think.

Well, I think the definition of simplicity is up for debate in this 
case... "LE everywhere" is much simpler than "it depends", IMHO.

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
> > We may still regret not doing *everything* little-endian, but this
> > doesn't make it worse.
>
> Hmm, why *don't* we just do everything LE, including the ring?

Mainly because when requirements are in doubt, simplicity wins, I think.

Cheers,
Rusty.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Make virtio devices multi-function (v2)

2008-04-22 Thread Anthony Liguori
This patch changes virtio devices to be multi-function devices whenever
possible.  This increases the number of virtio devices we can support now by
a factor of 8.

With this patch, I've been able to launch a guest with either 220 disks or 220
network adapters.

Since v1, I've changed the way virtio devices are allocated to be as follows:

 1) Always use a slot as long as they are available.  We can extend this to
use a PCI when we get that working more reliably.

 2) When PCI slots are exhausted, fall back add device as an additional
function on an existing slot

This way, hotplug continues to work just as well as it does now.  Once you
exceed the number of PCI slots, you need an OS that can do hotplug of
individual PCI functions if you care about doing hotplug.  I think this is a
pretty reasonable trade-off.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c
index a23a466..5d5d1a5 100644
--- a/qemu/hw/pci.c
+++ b/qemu/hw/pci.c
@@ -146,6 +146,41 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 return 0;
 }
 
+/* Search the bus for a multifunction device with a free function that
+ * matches vendor_id_filter and device_id_filter.  -1 can be passed as
+ * a filter value to accept any id.
+ */
+int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter,
+int device_id_filter)
+{
+int devfn;
+
+for (devfn = bus->devfn_min; devfn < 256; devfn += 8) {
+   int vendor_id, device_id;
+   PCIDevice *pci_dev;
+
+   if (!bus->devices[devfn])
+   continue;
+
+   pci_dev = bus->devices[devfn];
+   vendor_id = pci_dev->config[0x01] << 8 | pci_dev->config[0x00];
+   device_id = pci_dev->config[0x03] << 8 | pci_dev->config[0x02];
+
+   if ((vendor_id_filter == -1 || vendor_id_filter == vendor_id) &&
+   (device_id_filter == -1 || device_id_filter == device_id) &&
+   ((pci_dev->config[0x0e] & 0x80) == 0x80)) {
+   int i;
+
+   for (i = 1; i < 8; i++) {
+   if (!bus->devices[devfn + i])
+   return devfn + i;
+   }
+   }
+}
+
+return -1;
+}
+
 /* -1 for devfn means auto assign */
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
int instance_size, int devfn,
diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index 60e4094..84d6a29 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -33,7 +33,7 @@ typedef struct PCIIORegion {
 #define PCI_ROM_SLOT 6
 #define PCI_NUM_REGIONS 7
 
-#define PCI_DEVICES_MAX 64
+#define PCI_DEVICES_MAX 256
 
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
@@ -105,6 +105,9 @@ void pci_info(void);
 PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint32_t id,
 pci_map_irq_fn map_irq, const char *name);
 
+int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter,
+int device_id_filter);
+
 /* lsi53c895a.c */
 #define LSI_MAX_DEVS 7
 void lsi_scsi_attach(void *opaque, BlockDriverState *bd, int id);
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 6a50001..361455d 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -405,12 +405,22 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 PCIDevice *pci_dev;
 uint8_t *config;
 uint32_t size;
+int devfn = -1;
 
-pci_dev = pci_register_device(bus, name, struct_size,
- -1, NULL, NULL);
-if (!pci_dev)
+pci_dev = pci_register_device(bus, name, struct_size, -1, NULL, NULL);
+
+if (pci_dev == NULL) {
+   devfn = pci_bus_find_device_function(bus, vendor, -1);
+   if (devfn != -1)
+   pci_dev = pci_register_device(bus, name, struct_size,
+ devfn, NULL, NULL);
+}
+
+if (pci_dev == NULL)
return NULL;
 
+devfn = pci_dev->devfn;
+
 vdev = to_virtio_device(pci_dev);
 
 vdev->status = 0;
@@ -438,6 +448,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 
 config[0x3d] = 1;
 
+/* Mark device as multi-function */
+if ((devfn % 8) == 0)
+   config[0x0e] |= 0x80;
+
 vdev->name = name;
 vdev->config_len = config_size;
 if (vdev->config_len)
diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..3bada75 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd);
 
 /* NIC info */
 
-#define MAX_NICS 8
+#define MAX_NICS 256
 
 struct NICInfo {
 uint8_t macaddr[6];
diff --git a/qemu/sysemu.h b/qemu/sysemu.h
index c60072d..4385802 100644
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -149,7 +149,7 @@ typedef struct DriveInfo {
 
 #define MAX_IDE_DEVS   2
 #define MAX_SCSI_DEVS  7
-#define MAX_DRIVES 32
+#define MAX_DRIVES 256
 
 int nb_drives;
 DriveInfo drives_table[MAX_DRIVES+1];
diff --git a/qemu/vl.c b/qemu/vl.c
index 74be059..824e331 100644
--- a/qemu/vl.c

Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
> On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote:
> > > The virtio config space was originally chosen to be little-endian,
> > > because we thought the config might be part of the PCI config space
> > > for virtio_pci.  It's actually a separate mmio region, so that
> > > argument holds little water; as only x86 is currently using the virtio
> > > mechanism, we can change this (but must do so now, before the
> > > impending s390 and ppc merges).
> >
> > This will probably annoy Hollis which has guests that can go both ways.
> 
> Yes, I discussed this with Hollis.  But the virtio rings themselves already 
> have this issue: we don't do any endian conversion on them and assume 
> they're "our" endian in the guest.
> 
> We may still regret not doing *everything* little-endian, but this doesn't 
> make it worse.

Hmm, why *don't* we just do everything LE, including the ring?

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
> Thanks for adding most of my enhancements. But
> 
> 1. There is no real need for invalidate_page(). Can be done with 
>   invalidate_start/end. Needlessly complicates the API. One
>   of the objections by Andrew was that there mere multiple
>   callbacks that perform similar functions.

While I agree with that reading of Andrew's email about invalidate_page,
I think the GRU hardware makes a strong enough case to justify the two
seperate callouts.

Due to the GRU hardware, we can assure that invalidate_page terminates all
pending GRU faults (that includes faults that are just beginning) and can
therefore be completed without needing any locking.  The invalidate_page()
callout gets turned into a GRU flush instruction and we return.

Because the invalidate_range_start() leaves the page table information
available, we can not use a single page _start to mimick that
functionality.  Therefore, there is a documented case justifying the
seperate callouts.

I agree the case is fairly weak, but it does exist.  Given Andrea's
unwillingness to move and Jack's documented case, it is my opinion the
most likely compromise is to leave in the invalidate_page() callout.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Christoph Lameter
On Tue, 22 Apr 2008, Robin Holt wrote:

> putting it back into your patch/agreeing to it remaining in Andrea's
> patch?  If not, I think we can put this issue aside until Andrew gets
> out of the merge window and can decide it.  Either way, the patches
> become much more similar with this in.

One solution would be to separate the invalidate_page() callout into a
patch at the very end that can be omitted. AFACIT There is no compelling 
reason to have this callback and it complicates the API for the device 
driver writers. Not having this callback makes the way that mmu notifiers 
are called from the VM uniform which is a desirable goal.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Christoph Lameter
On Tue, 22 Apr 2008, Andrea Arcangeli wrote:

> My patch order and API backward compatible extension over the patchset
> is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
> XPMEM as well. KVM/GRU won't notice any difference once the support
> for XPMEM is added, but even if the API would completely change in
> 2.6.27, that's still better than no functionality at all in 2.6.26.

Please redo the patchset with the right order. To my knowledge there is no 
chance of this getting merged for 2.6.26.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Christoph Lameter
Doing the right patch ordering would have avoided this patch and allow 
better review.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-22 Thread Christoph Lameter
Why are the subjects all screwed up? They are the first line of the 
description instead of the subject line of my patches.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Christoph Lameter
Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Christoph Lameter
Missing signoff by you.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-04-22 Thread David S. Ahern
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:

1. before vcpu->arch.mmu.page_fault()
2. after vcpu->arch.mmu.page_fault()
3. after mmu_topup_memory_caches()
4. after emulate_instruction()

So the delta in the trace reports show:
- cycles required for arch.mmu.page_fault (tracer 2)
- cycles required for mmu_topup_memory_caches(tracer 3)
- cycles required for emulate_instruction() (tracer 4)

I captured trace data for ~5-seconds during one of the usual events (again this
time it was due to kscand in the guest). I ran the formatted trace data through
an awk script to summarize:

TSC cycles  tracer2   tracer3   tracer4
  0 -  10,000:   295067213251115873
 10,001 -  25,000: 7682  1004 98336
 25,001 -  50,000:  2011536
 50,001 - 100,000:   100655 010
> 100,000:  117 015

This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
took longer than 50,000 cycles. The page_fault function getting run is
paging64_page_fault.

mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
most of them relatively quickly.

Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
host processes could interrupt it.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776)  VMEXIT [ exitcode = 0x, rip = 0x
>> c016127c ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db4 ]
>> (+3632)  VMENTRY
>> (+4552)  VMEXIT [ exitcode = 0x, rip = 0x
>> c016104a ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x000b, virt = 0x
>> fffb61c8 ]
>> (+   54928)  VMENTRY
>>   
> 
> Can you oprofile the host to see where the 54K cycles are spent?
> 
>> (+4568)  VMEXIT [ exitcode = 0x, rip = 0x
>> c01610e7 ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db4 ]
>> (+   0)  PTE_WRITE  [ gpa = 0x 9db4 gpte = 0x
>> 41c5d363 ]
>> (+8432)  VMENTRY
>> (+3936)  VMEXIT [ exitcode = 0x, rip = 0x
>> c01610ee ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db0 ]
>> (+   0)  PTE_WRITE  [ gpa = 0x 9db0 gpte = 0x
>>  ]
>> (+   13832)  VMENTRY
>>
>>
>> (+5768)  VMEXIT [ exitcode = 0x, rip = 0x
>> c016127c ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db4 ]
>> (+3712)  VMENTRY
>> (+4576)  VMEXIT [ exitcode = 0x, rip = 0x
>> c016104a ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x000b, virt = 0x
>> fffb61d0 ]
>> (+   0)  PTE_WRITE  [ gpa = 0x 3d5981d0 gpte = 0x
>> 3d55d047 ]
>>   
> 
> This indeed has the accessed bit clear.
> 
>> (+   65216)  VMENTRY
>> (+4232)  VMEXIT [ exitcode = 0x, rip = 0x
>> c01610e7 ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db4 ]
>> (+   0)  PTE_WRITE  [ gpa = 0x 9db4 gpte = 0x
>> 3d598363 ]
>>   
> 
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
> 
>> (+8640)  VMENTRY
>> (+3936)  VMEXIT [ exitcode = 0x, rip = 0x
>> c01610ee ]
>> (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
>> c0009db0 ]
>> (+   0)  PTE_WRITE  [ gpa = 0x 9db0 gpte = 0x
>>  ]
>> (+   14160)  VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>   
> 
> Bumping the count was supposed to remove the flooding...
> 
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>   
> 
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
> 
> What host kernel are you running?  How many host cpus?
> 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Do

Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Christoph Lameter
Looks like this is not complete. There are numerous .h files missing which 
means that various structs are undefined (fs.h and rmap.h are needed 
f.e.) which leads to surprises when dereferencing fields of these struct.

It seems that mm_types.h is expected to be included only in certain 
contexts. Could you make sure to include all necessary .h files? Or add
some docs to clarify the situation here.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Problems with MAC address with e1000 on Windows 2003

2008-04-22 Thread Alberto Treviño
I was wondering if anyone could reproduce my problem.  If it is 
reproduceable, then I'll file a bug.

I am using e1000 ethernet adapters on Windows 2003 and Linux guests.  
The line to set it up is something like this:

  -net nic,vlan=1,macaddr=00:ff:21:cf:91:01,model=e1000 \
-net tap,vlan=1,ifname=tap.br1.91.1

On Linux, this works just fine.  However, on Windows 2003, the mac 
address for the device is reported as 00:ff:ff:ff:ff:ff and the packets 
carry this mac address as well.  The corresponding tap device has the 
correct IP address, however.  This problem is definitely tied to using 
Windows 2003 with a e1000 device.  If I use the rtl8139 device, Windows 
reports the correct mac address.  When booting the same VM with a Linux 
bootable CD and the e1000 device, Linux reports the correct mac address 
as set in the qemu command.  It's the combination of Windows 2003 and 
the e1000 device that causes the problem.

Has anyone else seen this problem?  Thanks in advance.

-- 
Alberto Treviño
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Christoph Lameter
Thanks for adding most of my enhancements. But

1. There is no real need for invalidate_page(). Can be done with 
invalidate_start/end. Needlessly complicates the API. One
of the objections by Andrew was that there mere multiple
callbacks that perform similar functions.

2. The locks that are used are later changed to semaphores. This is
   f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the
   lock conversion is done first and then mm_lock is introduced. The
   way the patches are structured means that reviewers cannot review the
   final version of mm_lock etc etc. The lock conversion needs to come 
   first.

3. As noted by Eric and also contained in private post from yesterday by 
   me: The cmp function needs to retrieve the value before
   doing comparisons which is not done for the == of a and b.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Contact Mr Philip Williams

2008-04-22 Thread Mrs Fatima Ali


Hello my good friend.

How are you today? Hope all is well with you and your family?, You may not 
understand why this mail came to you.But if you do not remember me, you might 
have receive an email from me in the past regarding a multi-million-dollar 
business proposal which we never concluded.

 

I am using this opportunity to inform you that this multi-million-dollar 
business has been concluded with the assistance of another partner from India 
who financed the transaction to alogical conclusion.

 

I thank you for your great effort to our unfinished transfer of fund into your 
account due to one reason or the other best known to you.But I want to 
informyou that I have successfully transferred the fund out of my bank to my 
new  partner's account in India that was capable of assisting me in this great 
venture.

 

Due to your effort, sincerity, courage and trustworthiness You showed during  
the course of the transaction.I want to compensate you and show my gratitude to 
you with the sum of $1,200,000.00. I haveleft a certified international bank 
cheque for youworth of $1,200,000.00 cashable anywhere in the world. 

 

My dear friend I will like you to contact my Account Officer Mr. Philip 
Williams, on his direct email address at:[EMAIL PROTECTED] for the collection 
of your bank cheque. I  authorized him to release theBank Cheque to you 
whenever you contact him regardingthe cheque.

 

At the moment, I'm very busy here because of the investment projects, which I 
and the new partner are having at hand.Please I will like you to accept  this 
token with good faith as this is from the bottom of my heart,Also comply with 
Mr. Phillip's directives so that he will send the cheque to you without any  
delay.

 

CONTACT: Mr. Philip Williams.

Account Officer,

Cotonou, Benin Republic,

His email address:  [EMAIL PROTECTED]

Therefore, you should send him your full Name and telephone number/your  
correct mailing address where

you want him to send the draft to you.

 

Thanks and God bless you and your family.

Hoping to hear from you.

Mrs Fatima Ali 





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
> > 1) invalidate_page:  You retain an invalidate_page() callout.  I believe
> > we have progressed that discussion to the point that it requires some
> > direction for Andrew, Linus, or somebody in authority.  The basics
> > of the difference distill down to no expected significant performance
> > difference between the two.  The invalidate_page() callout potentially
> > can simplify GRU code.  It does provide a more complex api for the
> > users of mmu_notifier which, IIRC, Christoph had interpretted from one
> > of Andrew's earlier comments as being undesirable.  I vaguely recall
> > that sentiment as having been expressed.
> 
> invalidate_page as demonstrated in KVM pseudocode doesn't change the
> locking requirements, and it has the benefit of reducing the window of
> time the secondary page fault has to be masked and at the same time
> _halves_ the number of _hooks_ in the VM every time the VM deal with
> single pages (example: do_wp_page hot path). As long as we can't fully
> converge because of point 3, it'd rather keep invalidate_page to be
> better. But that's by far not a priority to keep.

Christoph, Jack and I just discussed invalidate_page().  I don't think
the point Andrew was making is that compelling in this circumstance.
The code has change fairly remarkably.  Would you have any objection to
putting it back into your patch/agreeing to it remaining in Andrea's
patch?  If not, I think we can put this issue aside until Andrew gets
out of the merge window and can decide it.  Either way, the patches
become much more similar with this in.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Daniel P. Berrange
On Tue, Apr 22, 2008 at 02:26:45PM -0300, Marcelo Tosatti wrote:
> On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote:
> > Avi Kivity wrote:
> > >Anthony Liguori wrote:
> > >>
> > >>I think we need to decide what we want to target in terms of upper 
> > >>limits.
> > >>
> > >>With a bridge or two, we can probably easily do 128.
> > >>
> > >>If we really want to push things, I think we should do a PCI based 
> > >>virtio controller.  I doubt a large number of PCI devices is ever 
> > >>going to perform very well b/c of interrupt sharing and some of the 
> > >>assumptions in virtio_pci.
> >>
> > >>If we implement a controller, we can use a single interrupt, but 
> > >>multiplex multiple notifications on that single interrupt.  We can 
> > >>also be more aggressive about using shared memory instead of PCI 
> > >>config space which would reduce the overall number of exits.
> 
> We should increase the number of interrupt lines, perhaps to 16.
> 
> Using shared memory to avoid exits sounds very good idea.
> 
> > >>We could easily support a very large number of devices this way.  But 
> > >>again, what do we want to target for now? 
> > >
> > >I think that for networking we should keep things as is.  I don't see 
> > >anybody using 100 virtual NICs.
> 
> The target was along the lines of 20 nics + 80 disks. Dan?

I've already had people ask for ability to as many as 64 disks and 32 nics
with Xen, so to my mind, the more we support the better. 100's if possible.

Dan.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Marcelo Tosatti wrote:
> On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote:
>   
>> Anthony Liguori wrote:
>> 
>>> This patch changes virtio devices to be multi-function devices whenever
>>> possible.  This increases the number of virtio devices we can support now by
>>> a factor of 8.
>>>
>>> With this patch, I've been able to launch a guest with either 220 disks or 
>>> 220
>>> network adapters.
>>>
>>>   
>>>   
>> Does this play well with hotplug?  Perhaps we need to allocate a new 
>> device on hotplug.
>>
>> (certainly if we have a device with one function, which then gets 
>> converted to a multifunction device)
>> 
>
> Would have to change the hotplug code to handle functions...
>   

BTW, I've never been that convinced that hotplugging devices is as 
useful as people make it out to be.  I also think that's particularly 
true when it comes to hot adding/removing very large numbers of disks.

I think if you created all virtio devices as multifunction devices, but 
didn't add additional functions until you ran out of PCI slots, it would 
be a pretty acceptable solution.  Hotplug works just as it does today 
until you get much higher than 32 devices.  Even then, hotplug still 
works with most of your devices (until you hit the absolute maximum 
number of devices of course).

Regards,

Anthony Liguori

> It sounds less hacky to just extend the PCI slots instead of (ab)using
> multiple functions per-slot.
>   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Ian Kirk
Avi Kivity wrote:

> For mass storage, we should follow the SCSI model with a single device
> serving multiple disks, similar to what you suggest.  Not sure if the
> device should have a single queue or one queue per disk.

Don't you just end up re-implementing SCSI then, at which point you might
as well stick with a 'fake' SCSI device in the guest?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori

>>> For mass storage, we should follow the SCSI model with a single device 
>>> serving multiple disks, similar to what you suggest.  Not sure if the 
>>> device should have a single queue or one queue per disk.
>>>   
>> My latest thought it to do a virtio-based virtio controller.
>> 
>
> Why do you dislike multiple disks per virtio-blk controller? As
> mentioned this seems a natural way forward.
>   

Logically speaking, virtio is a bus.  virtio supports all of the 
features of a bus (discover, hot add, hot remove).

Right now, we map virtio devices directly onto the PCI bus.

The problem we're trying to address is limitations of the PCI bus.  We 
have a couple options:

1) add a virtio device that supports multiple disks.  we need to 
reinvent hotplug within this device.

2) add a new PCI virtio transport that supports multiple virtio-blk 
devices within a single PCI slot

3) add a generic PCI virtio transport that supports multiple virtio 
devices within a single PCI slot

4) add a generic virtio "bridge" that supports multiple virtio devices 
within a single virtio device.

#4 may seem strange, but it's no different from a PCI-to-PCI bridge.

I like #4 the most, but #2 is probably the most practical.


Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
> 1) invalidate_page:  You retain an invalidate_page() callout.  I believe
> we have progressed that discussion to the point that it requires some
> direction for Andrew, Linus, or somebody in authority.  The basics
> of the difference distill down to no expected significant performance
> difference between the two.  The invalidate_page() callout potentially
> can simplify GRU code.  It does provide a more complex api for the
> users of mmu_notifier which, IIRC, Christoph had interpretted from one
> of Andrew's earlier comments as being undesirable.  I vaguely recall
> that sentiment as having been expressed.

invalidate_page as demonstrated in KVM pseudocode doesn't change the
locking requirements, and it has the benefit of reducing the window of
time the secondary page fault has to be masked and at the same time
_halves_ the number of _hooks_ in the VM every time the VM deal with
single pages (example: do_wp_page hot path). As long as we can't fully
converge because of point 3, it'd rather keep invalidate_page to be
better. But that's by far not a priority to keep.

> 2) Range callout names: Your range callouts are invalidate_range_start
> and invalidate_range_end whereas Christoph's are start and end.  I do not
> believe this has been discussed in great detail.  I know I have expressed
> a preference for your names.  I admit to having failed to follow up on
> this issue.  I certainly believe we could come to an agreement quickly
> if pressed.

I think using ->start ->end is a mistake, think when we later add
mprotect_range_start/end. Here too I keep the better names only
because we can't converge on point 3 (the API will eventually change,
like every other kernel interal API, even core things like __free_page
have been mostly obsoleted).

> 3) The structure of the patch set:  Christoph's upcoming release orders
> the patches so the prerequisite patches are seperately reviewable
> and each file is only touched by a single patch.  Additionally, that

Each file touched by a single patch? I doubt... The split is about the
same, the main difference is the merge ordering, I always had the zero
risk part at the head, he moved it at the tail when he incorporated
#v12 into his patchset.

> allows mmu_notifiers to be introduced as a single patch with sleeping
> functionality from its inception and an API which remains unchanged.
> Your patch set, however, introduces one API, then turns around and
> changes that API.  Again, the desire to make it an unchanging API was
> expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
> the non-sleeping API may become entrenched and make acceptance of the
> sleeping version less acceptable.
> 
> Can we agree upon this list of issues?

This is a kernel internal API, so it will definitely change over
time. It's nothing close to a syscall.

Also note: the API is obviously defined in mmu_notifier.h and none of
the 2-12 patches touches mmu_notifier.h. So the extension of the
method semantics is 100% backwards compatible.

My patch order and API backward compatible extension over the patchset
is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
XPMEM as well. KVM/GRU won't notice any difference once the support
for XPMEM is added, but even if the API would completely change in
2.6.27, that's still better than no functionality at all in 2.6.26.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)

2008-04-22 Thread Alberto Treviño
Thanks for all those who work on KVM.  It is a wonderful product and I 
have been very impressed with its features, performance, and the level 
of activity in this project.

Back in February a bug was filed.  I've been hit by this bug as well, 
but there hasn't been much activity with it in the last little bit.  I 
wanted to know if anyone had a fix for it, or a workaround (other than 
using IDE), or whether it was on someone's radar.  Here is a link to 
the bug:

http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831

Thanks in advance.

-- 
Alberto Treviño
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Robin Holt

I believe the differences between your patch set and Christoph's need
to be understood and a compromise approach agreed upon.

Those differences, as I understand them, are:

1) invalidate_page:  You retain an invalidate_page() callout.  I believe
we have progressed that discussion to the point that it requires some
direction for Andrew, Linus, or somebody in authority.  The basics
of the difference distill down to no expected significant performance
difference between the two.  The invalidate_page() callout potentially
can simplify GRU code.  It does provide a more complex api for the
users of mmu_notifier which, IIRC, Christoph had interpretted from one
of Andrew's earlier comments as being undesirable.  I vaguely recall
that sentiment as having been expressed.

2) Range callout names: Your range callouts are invalidate_range_start
and invalidate_range_end whereas Christoph's are start and end.  I do not
believe this has been discussed in great detail.  I know I have expressed
a preference for your names.  I admit to having failed to follow up on
this issue.  I certainly believe we could come to an agreement quickly
if pressed.

3) The structure of the patch set:  Christoph's upcoming release orders
the patches so the prerequisite patches are seperately reviewable
and each file is only touched by a single patch.  Additionally, that
allows mmu_notifiers to be introduced as a single patch with sleeping
functionality from its inception and an API which remains unchanged.
Your patch set, however, introduces one API, then turns around and
changes that API.  Again, the desire to make it an unchanging API was
expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
the non-sleeping API may become entrenched and make acceptance of the
sleeping version less acceptable.

Can we agree upon this list of issues?

Thank you,
Robin Holt

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-22 Thread Glauber Costa
Gerd Hoffmann wrote:
> Jeremy Fitzhardinge wrote:
>> Xen could change the parameters in the instant after get_time_values(). 
>> That change could be as a result of suspend-resume, so the parameters
>> and the tsc could be wildly different.
> 
> Ah, ok, forgot the rdtsc in the picture.  With that in mind I fully
> agree that the loop is needed.  I think kvm guests can even hit that one
> with the vcpu migrating to a different physical cpu, so we better handle
> it correctly ;)

It's probably not needed for kvm, since we update everything everytime 
we get scheduled in the host side, which would cover the case for 
migration between physical cpus. But it's probably okay to do it to get 
a common denominator with xen, if needed.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 2/2] KVM: Handle interrupts for PCI passthrough devices

2008-04-22 Thread Amit Shah
* On Sunday 13 Apr 2008 14:06:27 Avi Kivity wrote:
> Amit Shah wrote:
> > Passthrough devices are host machine PCI devices which have
> > been handed off to the guest. Handle interrupts from these
> > devices and route them to the appropriate guest irq lines.
> > The userspace provides us with the necessary information
> > via the ioctls.
> >
> > The guest IRQ numbers can change dynamically, so we have an
> > additional ioctl that keeps track of those changes in userspace
> > and notifies us whenever that happens.
> >
> > It is expected the kernel driver for the passthrough device
> > is removed before passing it on to the guest.
> >
> >
> > +/*
> > + * Used to find a registered host PCI device (a "passthrough" device)
> > + * during interrupts or EOI
> > + */
> > +static struct kvm_pci_pt_dev_list *
> > +find_pci_pt_dev(struct list_head *head,
> > +   struct kvm_pci_pt_info *pv_pci_info, int irq, int source)
> > +{
> > +   struct list_head *ptr;
> > +   struct kvm_pci_pt_dev_list *match;
> > +
> > +   list_for_each(ptr, head) {
> > +   match = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
> > +
> > +   switch (source) {
> > +   case KVM_PT_SOURCE_IRQ:
> > +   /*
> > +* Used to find a registered host device
> > +* during interrupt context on host
> > +*/
> > +   if (match->pt_dev.host.irq == irq)
> > +   return match;
> > +   break;
> > +   case KVM_PT_SOURCE_IRQ_ACK:
> > +   /*
> > +* Used to find a registered host device when
> > +* the guest acks an interrupt
> > +*/
> > +   if (match->pt_dev.guest.irq == irq)
> > +   return match;
> > +   break;
> > +   }
> > +   }
> > +   return NULL;
> > +}
>
> This would be better as two separate functions.  Also, locking?

For pvdma, there will be two more cases. Very similar functions for 
essentially looking up an entry in the same list.

Locking will be supported soon.

> > +static irqreturn_t
> > +kvm_pci_pt_dev_intr(int irq, void *dev_id)
>
> Please don't split declarations unnecessarily.

Fixed.

> > +{
> > +   struct kvm_pci_pt_dev_list *match;
> > +   struct kvm *kvm = (struct kvm *) dev_id;
> > +
> > +   if (!test_bit(irq, pt_irq_handled))
> > +   return IRQ_NONE;
> > +
> > +   if (test_bit(irq, pt_irq_pending))
> > +   return IRQ_HANDLED;
>
> Will the interrupt not fire immediately after this returns?

Hmm. This is just an optimisation so that we don't have to look up the list 
each time to find out which assigned device it is and (re)injecting the 
interrupt. Also we avoid the (TODO) getting/releasing locks which will be 
needed for the list lookup.

Disabling interrupts for PCI devices isn't a good idea even if we don't 
support shared interrupts. Any other ideas to avoid this from happening?

> > +   match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL,
> > +   irq, KVM_PT_SOURCE_IRQ);
> > +   if (!match)
> > +   return IRQ_NONE;
> > +
> > +   /* Not possible to detect if the guest uses the PIC or the
> > +* IOAPIC.  So set the bit in both. The guest will ignore
> > +* writes to the unused one.
> > +*/
> > +   kvm_ioapic_set_irq(kvm->arch.vioapic, match->pt_dev.guest.irq, 1);
> > +   kvm_pic_set_irq(pic_irqchip(kvm), match->pt_dev.guest.irq, 1);
>
> A function that calls both the apic and the pic is better, as it will be
> easier to port.

Done.

> > +   set_bit(irq, pt_irq_pending);
> > +   return IRQ_HANDLED;
> > +}
> > +
> > +/* Ack the irq line for a passthrough device */
> > +void
> > +kvm_pci_pt_ack_irq(struct kvm *kvm, int vector)
> > +{
> > +   int irq;
> > +   struct kvm_pci_pt_dev_list *match;
> > +
> > +   irq = get_eoi_gsi(kvm->arch.vioapic, vector);
> > +   match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL,
> > +   irq, KVM_PT_SOURCE_IRQ_ACK);
> > +   if (!match)
> > +   return;
> > +   if (test_bit(match->pt_dev.host.irq, pt_irq_pending)) {
> > +   kvm_ioapic_set_irq(kvm->arch.vioapic, irq, 0);
> > +   kvm_pic_set_irq(pic_irqchip(kvm), irq, 0);
>
> This is dangerous with smp guests, if we aren't careful with the
> ordering the interrupt may fire again and be forwarded to the other
> vcpu.  We need to call this before we redeliver interrupts.

The 'pending' bitmap ensures we don't inject an interrupt that hasn't been 
ack'ed. Once the locking is in place, this shouldn't be a worry.

> > +   clear_bit(match->pt_dev.host.irq, pt_irq_pending);
> > +   }
> > +}

...

> > @@ -1671,6 +1836,30 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > r = 0;
> > break;
> > }
> > +   case KVM_ASSIGN_PCI_PT_DEV: {
> > +   struct kvm_pci_passthrough_dev pci_pt_dev;
> > +
> > +   r

Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote:
> Avi Kivity wrote:
> >Anthony Liguori wrote:
> >>
> >>I think we need to decide what we want to target in terms of upper 
> >>limits.
> >>
> >>With a bridge or two, we can probably easily do 128.
> >>
> >>If we really want to push things, I think we should do a PCI based 
> >>virtio controller.  I doubt a large number of PCI devices is ever 
> >>going to perform very well b/c of interrupt sharing and some of the 
> >>assumptions in virtio_pci.
>>
> >>If we implement a controller, we can use a single interrupt, but 
> >>multiplex multiple notifications on that single interrupt.  We can 
> >>also be more aggressive about using shared memory instead of PCI 
> >>config space which would reduce the overall number of exits.

We should increase the number of interrupt lines, perhaps to 16.

Using shared memory to avoid exits sounds very good idea.

> >>We could easily support a very large number of devices this way.  But 
> >>again, what do we want to target for now? 
> >
> >I think that for networking we should keep things as is.  I don't see 
> >anybody using 100 virtual NICs.

The target was along the lines of 20 nics + 80 disks. Dan?

> >For mass storage, we should follow the SCSI model with a single device 
> >serving multiple disks, similar to what you suggest.  Not sure if the 
> >device should have a single queue or one queue per disk.
> 
> My latest thought it to do a virtio-based virtio controller.

Why do you dislike multiple disks per virtio-blk controller? As
mentioned this seems a natural way forward.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
> +
> +static int mm_lock_cmp(const void *a, const void *b)
> +{
> + cond_resched();
> + if ((unsigned long)*(spinlock_t **)a <
> + (unsigned long)*(spinlock_t **)b)
> + return -1;
> + else if (a == b)
> + return 0;
> + else
> + return 1;
> +}
> +
This compare function looks unusual...
It should work, but sort() could be faster if the
if (a == b) test had a chance to be true eventually...

static int mm_lock_cmp(const void *a, const void *b)
{
unsigned long la = (unsigned long)*(spinlock_t **)a;
unsigned long lb = (unsigned long)*(spinlock_t **)b;

cond_resched();
if (la < lb)
return -1;
if (la > lb)
return 1;
return 0;
}






-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Odd hang in the Ubuntu installer

2008-04-22 Thread Soren Hansen
Hi guys.

I'm trying to figure out what's going on with this bug:

   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217815

The short version of the problem is that it seems that if the console is
left alone for an extended period of time, everything seems to stall
until something (moving the mouse around, pressing a key, whatever)
awakens it again.  It usually shows itself when you choose the
"Encrypted LVM" option in our installer (this process wipes the drive,
which is a rather lenghty process), since that's probably the only place
where you'd leave the console alone for a while, while still getting
some UI feedback (and suddenly lack of feedback, obviously).

It started when I backported this to the kvm version in our archive:

commit d2668b3fd41f88c18a7f9c4f1d024f0e5d9f64cf
Author: Marcelo Tosatti <[EMAIL PROTECTED]>
Date:   Wed Apr 2 20:20:14 2008 -0300
Subject: kvm: qemu: separate thread for IO handling


While trying to solve this problem, I noticed that that commit was just
one of a set of three patches. Applying those two:

commit 1743ef816b6cd22d100ccb80e542b8ca19c75392
Author: Marcelo Tosatti <[EMAIL PROTECTED]>
Date:   Wed Apr 2 20:20:15 2008 -0300
Subject: kvm: qemu: add function to handle signals

commit d84f71afaafec49e0ab3aa7a33518df04c14f38a
Author: Marcelo Tosatti <[EMAIL PROTECTED]>
Date:   Wed Apr 2 20:20:16 2008 -0300
Subject: kvm: qemu: notify IO thread of pending bhs

...makes it take a bit longer before it happens, but it's still very
much reproducable. Reverting those changes fixes it completely.

We've tried with kvm 66, which also exhibits this behaviour, so I'm
fairly confident I didn't mess up the patch while backporting it. In
case you're interested, the backported patch is here:

   http://people.ubuntu.com/~soren/virtio_hang.patch

The latter two commits applied without changes (with a bit of fuzz,
though).

I'm hoping one of you guys could give me a hint (or perhaps even a
patch)?

-- 
Soren Hansen   | 
Virtualisation specialist  | Ubuntu Server Team
Canonical Ltd. | http://www.ubuntu.com/


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote:
> I am saying your intent was probably to test
>
> else if ((unsigned long)*(spinlock_t **)a ==
>   (unsigned long)*(spinlock_t **)b)
>   return 0;

Indeed...

> Hum, it's not a micro-optimization, but a bug fix. :)

The good thing is that even if this bug would lead to a system crash,
it would be still zero risk for everybody that isn't using KVM/GRU
actively with mmu notifiers. The important thing is that this patch
has zero risk to introduce regressions into the kernel, both when
enabled and disabled, it's like a new driver. I'll shortly resend 1/12
and likely 12/12 for theoretical correctness. For now you can go ahead
testing with this patch as it'll work fine despite of the bug (if it
wasn't the case I would have noticed already ;).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
> On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
>   
>> Andrea Arcangeli a écrit :
>> 
>>> +
>>> +static int mm_lock_cmp(const void *a, const void *b)
>>> +{
>>> +   cond_resched();
>>> +   if ((unsigned long)*(spinlock_t **)a <
>>> +   (unsigned long)*(spinlock_t **)b)
>>> +   return -1;
>>> +   else if (a == b)
>>> +   return 0;
>>> +   else
>>> +   return 1;
>>> +}
>>> +
>>>   
>> This compare function looks unusual...
>> It should work, but sort() could be faster if the
>> if (a == b) test had a chance to be true eventually...
>> 
>
> Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
>   
I am saying your intent was probably to test

else if ((unsigned long)*(spinlock_t **)a ==
(unsigned long)*(spinlock_t **)b)
return 0;


Because a and b are pointers to the data you want to compare. You need 
to dereference them.


>> static int mm_lock_cmp(const void *a, const void *b)
>> {
>>  unsigned long la = (unsigned long)*(spinlock_t **)a;
>>  unsigned long lb = (unsigned long)*(spinlock_t **)b;
>>
>>  cond_resched();
>>  if (la < lb)
>>  return -1;
>>  if (la > lb)
>>  return 1;
>>  return 0;
>> }
>> 
>
> If your intent is to use the assumption that there are going to be few
> equal entries, you should have used likely(la > lb) to signal it's
> rarely going to return zero or gcc is likely free to do whatever it
> wants with the above. Overall that function is such a slow path that
> this is going to be lost in the noise. My suggestion would be to defer
> microoptimizations like this after 1/12 will be applied to mainline.
>
> Thanks!
>
>   
Hum, it's not a micro-optimization, but a bug fix. :)

Sorry if it was not clear





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> I think we need to decide what we want to target in terms of upper 
>> limits.
>>
>> With a bridge or two, we can probably easily do 128.
>>
>> If we really want to push things, I think we should do a PCI based 
>> virtio controller.  I doubt a large number of PCI devices is ever 
>> going to perform very well b/c of interrupt sharing and some of the 
>> assumptions in virtio_pci.
>>
>> If we implement a controller, we can use a single interrupt, but 
>> multiplex multiple notifications on that single interrupt.  We can 
>> also be more aggressive about using shared memory instead of PCI 
>> config space which would reduce the overall number of exits.
>>
>> We could easily support a very large number of devices this way.  But 
>> again, what do we want to target for now? 
>
> I think that for networking we should keep things as is.  I don't see 
> anybody using 100 virtual NICs.
>
> For mass storage, we should follow the SCSI model with a single device 
> serving multiple disks, similar to what you suggest.  Not sure if the 
> device should have a single queue or one queue per disk.

My latest thought it to do a virtio-based virtio controller.

We could avoid creating one in QEMU unless we detect an abnormally large 
number of disks or something.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
>
> I think we need to decide what we want to target in terms of upper 
> limits.
>
> With a bridge or two, we can probably easily do 128.
>
> If we really want to push things, I think we should do a PCI based 
> virtio controller.  I doubt a large number of PCI devices is ever 
> going to perform very well b/c of interrupt sharing and some of the 
> assumptions in virtio_pci.
>
> If we implement a controller, we can use a single interrupt, but 
> multiplex multiple notifications on that single interrupt.  We can 
> also be more aggressive about using shared memory instead of PCI 
> config space which would reduce the overall number of exits.
>
> We could easily support a very large number of devices this way.  But 
> again, what do we want to target for now? 

I think that for networking we should keep things as is.  I don't see 
anybody using 100 virtual NICs.

For mass storage, we should follow the SCSI model with a single device 
serving multiple disks, similar to what you suggest.  Not sure if the 
device should have a single queue or one queue per disk.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread H. Peter Anvin
Nguyen Anh Quynh wrote:
> Hi,
> 
> I am thinking about comibing this ROM with the extboot. Both two ROM
> are about "booting", so I think that is reasonable. So we will have
> only 1 ROM that supports both external boot and Linux boot.
> 
> Is that desirable or not?
> 

Does it make the code simpler and easier to understand?  If not, then I 
would say no.

-hpa

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 06:22:48 Avi Kivity wrote:
> Rusty Russell wrote:
> > [Christian, Hollis, how much is this ABI breakage going to hurt you?]
> >
> > A recent proposed feature addition to the virtio block driver revealed
> > some flaws in the API, in particular how easy it is to break big
> > endian machines.
> >
> > The virtio config space was originally chosen to be little-endian,
> > because we thought the config might be part of the PCI config space
> > for virtio_pci.  It's actually a separate mmio region, so that
> > argument holds little water; as only x86 is currently using the virtio
> > mechanism, we can change this (but must do so now, before the
> > impending s390 and ppc merges).
> 
> This will probably annoy Hollis which has guests that can go both ways.

Rusty and I have discussed it. Ultimately, this just takes us from a 
cross-architecture endianness definition to a per-architecture definition. 
Anyways, we've already fallen into this situation with the virtio ring data 
itself, so we're really saying "same endianness as the ring".

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Marcelo Tosatti wrote:
>> Maybe require explicit device/function assignment on the command line?  
>> It will be managed anyway.
>> 
>
> ACPI does support hotplugging of individual functions inside slots,
> not sure how well does Linux (and other OSes) support that.. should be
> transparent though.
>   

I think we need to decide what we want to target in terms of upper limits.

With a bridge or two, we can probably easily do 128.

If we really want to push things, I think we should do a PCI based 
virtio controller.  I doubt a large number of PCI devices is ever going 
to perform very well b/c of interrupt sharing and some of the 
assumptions in virtio_pci.

If we implement a controller, we can use a single interrupt, but 
multiplex multiple notifications on that single interrupt.  We can also 
be more aggressive about using shared memory instead of PCI config space 
which would reduce the overall number of exits.

We could easily support a very large number of devices this way.  But 
again, what do we want to target for now?

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Javier Guerra
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity <[EMAIL PROTECTED]> wrote:
>  I'm rooting for btrfs myself.

but could btrfs (when stable) work for migration?  i'm curious about
OCFS2 performance on this kind of load...

when i manage to sell the idea of a KVM cluster i'd like to know if i
should try first EVMS-HA (cluster LV's) or OCFS (cluster FS)

-- 
Javier

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
> >And video streaming on some embedded devices with no MMU!  (Due to the
> >page cache heuristics working poorly with no MMU, sustained reliable
> >streaming is managed with O_DIRECT and the app managing cache itself
> >(like a database), and that needs AIO to keep the request queue busy.
> >At least, that's the theory.)
> 
> Could use threads as well, no?

Perhaps.  This raises another point about AIO vs. threads:

If I submit sequential O_DIRECT reads with aio_read(), will they enter
the device read queue in the same order, and reach the disk in that
order (allowing for reordering when worthwhile by the elevator)?

With threads this isn't guaranteed and scheduling makes it quite
likely to issue the parallel synchronous reads out of order, and for
them to reach the disk out of order because the elevator doesn't see
them simultaneously.

With AIO (non-Glibc! (and non-kthreads)) it might be better at
keeping the intended issue order, I'm not sure.

It is highly desirable: O_DIRECT streaming performance depends on
avoiding seeks (no reordering) and on keeping the request queue
non-empty (no gap).

I read a man page for some other unix, describing AIO as better than
threaded parallel reads for reading tape drives because of this (tape
seeks are very expensive).  But the rest of the man page didn't say
anything more.  Unfortunately I don't remember where I read it.  I
have no idea whether AIO submission order is nearly always preserved
in general, or expected to be.

> It's me at fault here.  I just assumed that because it's easy to do aio 
> in a thread pool efficiently, that's what glibc does.
> 
> Unfortunately the code does some ridiculous things like not service 
> multiple requests on a single fd in parallel.  I see absolutely no 
> reason for it (the code says "fight for resources").

Ouch.  Perhaps that relates to my thought above, about multiple
requests to the same file causing seek storms when thread scheduling
is unlucky?

> So my comments only apply to linux-aio vs a sane thread pool.  Sorry for 
> spreading confusion.

Thanks.  I thought you'd measured it :-)

> It could and should.  It probably doesn't.
> 
> A simple thread pool implementation could come within 10% of Linux aio 
> for most workloads.  It will never be "exactly", but for small numbers 
> of disks, close enough.

I would wait for benchmark results for I/O patterns like sequential
reading and writing, because of potential for seeks caused by request
reordering, before being confident of that.

> >Hmm.  Thanks.  I may consider switching to XFS now
> 
> I'm rooting for btrfs myself.

In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll
be happy to give it a try! :-)

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
> >Perhaps.  This raises another point about AIO vs. threads:
> >
> >If I submit sequential O_DIRECT reads with aio_read(), will they enter
> >the device read queue in the same order, and reach the disk in that
> >order (allowing for reordering when worthwhile by the elevator)?
> 
> Yes, unless the implementation in the kernel (or glibc) is threaded.

> 
> >With threads this isn't guaranteed and scheduling makes it quite
> >likely to issue the parallel synchronous reads out of order, and for
> >them to reach the disk out of order because the elevator doesn't see
> >them simultaneously.
> 
> If the disk is busy, it doesn't matter.  The requests will queue and the 
> elevator will sort them out.  So it's just the first few requests that 
> may get to disk out of order.

There's two cases where it matters to a read-streaming app:

1. Disk isn't busy with anything else, maximum streaming
   performance is desired.

2. Disk is busy with unrelated things, but you're using I/O
   priorities to give the streaming app near-absolute priority.
   Then you need to maintain overlapped streaming requests,
   otherwise disk is given to a lower priority I/O.  If that
   happens often, you lose, priority is ineffective.  Because one
   of the streaming requests is usually being serviced, elevator
   has similar limitations as for a disk which is not busy with
   anything else.

> I haven't considered tape, but this is a good point indeed.  I expect it 
> doesn't make much of a difference for a loaded disk.

Yes, as long as it's loaded with unrelated requests at the same I/O
priority, the elevator has time to sort requests and hide thread
scheduling artifacts.

Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's
elevator, then submits them to the host's elevator.  If the guest and
host elevators are both configured 'anticipatory', do the anticipatory
delays add up?

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 05:51:51PM +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> > Avi Kivity wrote:
> >> Anthony Liguori wrote:
> >>  
> >>> This patch changes virtio devices to be multi-function devices whenever
> >>> possible.  This increases the number of virtio devices we can 
> >>> support now by
> >>> a factor of 8.
> >>>
> >>> With this patch, I've been able to launch a guest with either 220 
> >>> disks or 220
> >>> network adapters.
> >>>
> >>>   
> >>
> >> Does this play well with hotplug?  Perhaps we need to allocate a new 
> >> device on hotplug.
> >>   
> >
> > Probably not.  I imagine you can only hotplug devices, not individual 
> > functions?
> >
> 
> It sounds reasonable to expect so.  ACPI has objects for devices, not 
> functions (IIRC).

So what I dislike about multifunction devices is the fact that a single
slot shares an IRQ, and that special code is required in the QEMU
drivers (virtio guest capability might not always be present).

I don't see any need for using them if we can extend PCI slots...

> Maybe require explicit device/function assignment on the command line?  
> It will be managed anyway.

ACPI does support hotplugging of individual functions inside slots,
not sure how well does Linux (and other OSes) support that.. should be
transparent though.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
Andrew, Could we get direction/guidance from you as regards
the invalidate_page() callout of Andrea's patch set versus the
invalidate_range_start/invalidate_range_end callout pairs of Christoph's
patchset?  This is only in the context of the __xip_unmap, do_wp_page,
page_mkclean_one, and try_to_unmap_one call sites.

On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
> > I am a little confused about the value of the seq_lock versus a simple
> > atomic, but I assumed there is a reason and left it at that.
> 
> There's no value for anything but get_user_pages (get_user_pages takes
> its own lock internally though). I preferred to explain it as a
> seqlock because it was simpler for reading, but I totally agree in the
> final implementation it shouldn't be a seqlock. My code was meant to
> be pseudo-code only. It doesn't even need to be atomic ;).

Unless there is additional locking in your fault path, I think it does
need to be atomic.

> > I don't know what you mean by "it'd" run slower and what you mean by
> > "armed and disarmed".
> 
> 1) when armed the time-window where the kvm-page-fault would be
> blocked would be a bit larger without invalidate_page for no good
> reason

But that is a distinction without a difference.  In the _start/_end
case, kvm's fault handler will not have any _DIRECT_ blocking, but
get_user_pages() had certainly better block waiting for some other lock
to prevent the process's pages being refaulted.

I am no VM expert, but that seems like it is critical to having a
consistent virtual address space.  Effectively, you have a delay on the
kvm fault handler beginning when either invalidate_page() is entered
or invalidate_range_start() is entered until when the _CALLER_ of the
invalidate* method has unlocked.  That time will remain essentailly
identical for either case.  I would argue you would be hard pressed to
even measure the difference.

> 2) if you were to remove invalidate_page when disarmed the VM could
> would need two branches instead of one in various places

Those branches are conditional upon there being list entries.  That check
should be extremely cheap.  The vast majority of cases will have no
registered notifiers.  The second check for the _end callout will be
from cpu cache.

> I don't want to waste cycles if not wasting them improves performance
> both when armed and disarmed.

In summary, I think we have narrowed down the case of no registered
notifiers to being infinitesimal.  The case of registered notifiers
being a distinction without a difference.

> > When I was discussing this difference with Jack, he reminded me that
> > the GRU, due to its hardware, does not have any race issues with the
> > invalidate_page callout simply doing the tlb shootdown and not modifying
> > any of its internal structures.  He then put a caveat on the discussion
> > that _either_ method was acceptable as far as he was concerned.  The real
> > issue is getting a patch in that satisfies all needs and not whether
> > there is a seperate invalidate_page callout.
> 
> Sure, we have that patch now, I'll send it out in a minute, I was just
> trying to explain why it makes sense to have an invalidate_page too
> (which remains the only difference by now), removing it would be a
> regression on all sides, even if a minor one.

I think GRU is the only compelling case I have heard for having the
invalidate_page seperate.  In the case of the GRU, the hardware enforces a
lifetime of the invalidate which covers all in-progress faults including
ones where the hardware is informed after the flush of a PTE.  in all
cases, once the GRU invalidate instruction is issued, all active requests
are invalidated.  Future faults will be blocked in get_user_pages().
Without that special feature of the hardware, I don't think any code
simplification exists.  I, of course, reserve the right to be wrong.

I believe the argument against a seperate invalidate_page() callout was
Christoph's interpretation of Andrew's comments.  I am not certain Andrew
was aware of this special aspects of the GRU hardware and whether that
had been factored into the discussion at that point in time.


Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Avi Kivity
Andrea Arcangeli wrote:
> On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
>   
>> Andrea Arcangeli a écrit :
>> 
>>> +
>>> +static int mm_lock_cmp(const void *a, const void *b)
>>> +{
>>> +   cond_resched();
>>> +   if ((unsigned long)*(spinlock_t **)a <
>>> +   (unsigned long)*(spinlock_t **)b)
>>> +   return -1;
>>> +   else if (a == b)
>>> +   return 0;
>>> +   else
>>> +   return 1;
>>> +}
>>> +
>>>   
>> This compare function looks unusual...
>> It should work, but sort() could be faster if the
>> if (a == b) test had a chance to be true eventually...
>> 
>
> Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
>
>   

You need to compare *a to *b (at least, that's what you're doing for the 
< case).

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
> Anthony Liguori wrote:
> >>If I submit sequential O_DIRECT reads with aio_read(), will they enter
> >>the device read queue in the same order, and reach the disk in that
> >>order (allowing for reordering when worthwhile by the elevator)?
> >>  
> >There's no guarantee that any sort of order will be preserved by AIO 
> >requests.  The same is true with writes.  This is what fdsync is for, 
> >to guarantee ordering.
> 
> I believe he'd like a hint to get good scheduling, not a guarantee.  
> With a thread pool if the threads are scheduled out of order, so are 
> your requests.

> If the elevator doesn't plug the queue, the first few requests may
> not be optimally sorted.

That's right.  Then they tend to settle to a good order.  But any
delay in scheduling one of the threads, or a signal received by one of
them, can make it lose order briefly, making the streaming stutter as
the disk performes a few local seeks until it settles to good order
again.

You can mitigate the disruption in various ways.

  1. If all threads share an "offset" variable, and reads and
 increments that atomically just prior to calling pread(), that helps
 especially at the start.  (If threaded I/O is used for QEMU disk
 emulation, I would suggest doing that, in the more general form
 of popping a request from QEMU's internal shared queue at the last
 moment.)

  2. Using more threads helps keep it sustained, at the cost of more
 wasted I/O when there's a cancellation (changed mind), and more
 memory.

However, AIO, in principle (if not implementations...) could be better
at keeping the suggested I/O order than thread, without special tricks.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> > This patch changes virtio devices to be multi-function devices whenever
> > possible.  This increases the number of virtio devices we can support now by
> > a factor of 8.
> >
> > With this patch, I've been able to launch a guest with either 220 disks or 
> > 220
> > network adapters.
> >
> >   
> 
> Does this play well with hotplug?  Perhaps we need to allocate a new 
> device on hotplug.
> 
> (certainly if we have a device with one function, which then gets 
> converted to a multifunction device)

Would have to change the hotplug code to handle functions...

It sounds less hacky to just extend the PCI slots instead of (ab)using
multiple functions per-slot.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Ryan Harper wrote:
> * Anthony Liguori <[EMAIL PROTECTED]> [2008-04-22 09:16]:
>   
>> This patch changes virtio devices to be multi-function devices whenever
>> possible.  This increases the number of virtio devices we can support now by
>> a factor of 8.
>>
>> With this patch, I've been able to launch a guest with either 220 disks or 
>> 220
>> network adapters.
>> 
>
> Have you confirmed that the network devices show up?  I was playing
> around with some of the limits last night and while it is easy to get
> QEMU to create the adapters, so far I've only had a guest see 29 pci
> nics (e1000).
>   

Yup, I had an eth219

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
> Andrea Arcangeli a écrit :
>> +
>> +static int mm_lock_cmp(const void *a, const void *b)
>> +{
>> +cond_resched();
>> +if ((unsigned long)*(spinlock_t **)a <
>> +(unsigned long)*(spinlock_t **)b)
>> +return -1;
>> +else if (a == b)
>> +return 0;
>> +else
>> +return 1;
>> +}
>> +
> This compare function looks unusual...
> It should work, but sort() could be faster if the
> if (a == b) test had a chance to be true eventually...

Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

> static int mm_lock_cmp(const void *a, const void *b)
> {
>   unsigned long la = (unsigned long)*(spinlock_t **)a;
>   unsigned long lb = (unsigned long)*(spinlock_t **)b;
>
>   cond_resched();
>   if (la < lb)
>   return -1;
>   if (la > lb)
>   return 1;
>   return 0;
> }

If your intent is to use the assumption that there are going to be few
equal entries, you should have used likely(la > lb) to signal it's
rarely going to return zero or gcc is likely free to do whatever it
wants with the above. Overall that function is such a slow path that
this is going to be lost in the noise. My suggestion would be to defer
microoptimizations like this after 1/12 will be applied to mainline.

Thanks!


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Ryan Harper
* Anthony Liguori <[EMAIL PROTECTED]> [2008-04-22 09:16]:
> This patch changes virtio devices to be multi-function devices whenever
> possible.  This increases the number of virtio devices we can support now by
> a factor of 8.
> 
> With this patch, I've been able to launch a guest with either 220 disks or 220
> network adapters.

Have you confirmed that the network devices show up?  I was playing
around with some of the limits last night and while it is easy to get
QEMU to create the adapters, so far I've only had a guest see 29 pci
nics (e1000).


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Anthony Liguori wrote:
> >Perhaps.  This raises another point about AIO vs. threads:
> >
> >If I submit sequential O_DIRECT reads with aio_read(), will they enter
> >the device read queue in the same order, and reach the disk in that
> >order (allowing for reordering when worthwhile by the elevator)?
> 
> There's no guarantee that any sort of order will be preserved by AIO 
> requests.  The same is true with writes.  This is what fdsync is for, to 
> guarantee ordering.

You misunderstand.  I'm not talking about guarantees, I'm talking
about expectations for the performance effect.

Basically, to do performant streaming read with O_DIRECT you need two
things:

   1. Overlap at least 2 requests, so the device is kept busy.

   2. Requests be sent to the disk in a good order, which is usually
  (but not always) sequential offset order.

The kernel does this itself with buffered reads, doing readahead.
It works very well, unless you have other problems caused by readahead.

With O_DIRECT, an application has to do the equivalent of readahead
itself to get performant streaming.

If the app uses two threads calling pread(), it's hard to ensure the
kernel even _sees_ the first two calls in sequential offset order.
You spawn two threads, and then both threads call pread() with
non-deterministic scheduling.  The problem starts before even entering
the kernel.

Then, depending on I/O scheduling in the kernel, it might send the
less good pread() to the disk immediately, then later a backward head
seek and the other one.  The elevator cannot fix this: it doesn't have
enough information, unless it adds artificial delays.  But artificial
delays may harm too; it's not optimal.

After that, the two threads tend to call pread() in the best order
provided there's no scheduling conflicts, but are easily disrupted by
other tasks, especially on SMP (one reading thread per CPU, so when
one of them is descheduled, the other continues and issues a request
in the 'wrong' order.)

With AIO, even though you can't be sure what the kernel does, you can
be sure the kernel receives aio_read() calls in the exact order which
is most likely to perform well.  Application knowledge of it's access
pattern is passed along better.

As I've said, I saw a man page which described why this makes AIO
superior to using threads for reading tapes on that OS.  So it's not a
completely spurious point.

This has nothing to do with guarantees.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Luca Tettamanti
On Tue, Apr 22, 2008 at 4:15 PM, Anthony Liguori <[EMAIL PROTECTED]> wrote:
> This patch changes virtio devices to be multi-function devices whenever
>  possible.  This increases the number of virtio devices we can support now by
>  a factor of 8.
[...]
>  diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
>  index 9100bb1..9ea14d3 100644
>  --- a/qemu/hw/virtio.c
>  +++ b/qemu/hw/virtio.c
>  @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
> *name,
>  PCIDevice *pci_dev;
>  uint8_t *config;
>  uint32_t size;
>  +static int devfn = 7;
>  +
>  +if ((devfn % 8) == 7)
>  +   devfn = -1;
>  +else
>  +   devfn++;

This code look strange... devfn should be passed to virtio_init_pci by
virtio-{net,blk} init functions, no?

Luca

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
>>
>> If I submit sequential O_DIRECT reads with aio_read(), will they enter
>> the device read queue in the same order, and reach the disk in that
>> order (allowing for reordering when worthwhile by the elevator)?
>>   
>
> There's no guarantee that any sort of order will be preserved by AIO 
> requests.  The same is true with writes.  This is what fdsync is for, 
> to guarantee ordering.

I believe he'd like a hint to get good scheduling, not a guarantee.  
With a thread pool if the threads are scheduled out of order, so are 
your requests.  If the elevator doesn't plug the queue, the first few 
requests may not be optimally sorted.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Anthony Liguori
Nguyen Anh Quynh wrote:
> Hi,
>
> This should be submitted to upstream (but not to kvm-devel list), but
> this is only the test code that I want to quickly send out for
> comments. In case it looks OK, I will send it to upstream later.
>
> Inspired by extboot and conversations with Anthony and HPA, this
> linuxboot option ROM is a simple option ROM that intercepts int19 in
> order to execute linux setup code. This approach eliminates the need
> to manipulate the boot sector for this purpose.
>
> To test it, just load linux kernel with your KVM/QEMU image using
> -kernel option in normal way.
>
> I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest
> Ubuntu 8.04.
>   

For the next rounds, could you actually rebase against upstream QEMU and 
submit to qemu-devel?  One of Paul Brook's objections to extboot had 
historically been that it wasn't not easily sharable with other 
architectures.  With a C version, it seems more reasonable now to do that.

Make sure you remove all the old linux boot code too within QEMU along 
with the -hda checks.

Regards,

Anthony Liguori

> Thanks,
> Quynh
>
>
> # diffstat linuxboot1.diff
>  Makefile |   13 -
>  linuxboot/Makefile   |   40 +++
>  linuxboot/boot.S |   54 +
>  linuxboot/farvar.h   |  130 
> +++
>  linuxboot/rom.c  |  104 
>  linuxboot/signrom|binary
>  linuxboot/signrom.c  |  128 
> ++
>  linuxboot/util.h |   69 +++
>  qemu/Makefile|3 -
>  qemu/Makefile.target |2
>  qemu/hw/linuxboot.c  |   39 +++
>  qemu/hw/pc.c |   22 +++-
>  qemu/hw/pc.h |5 +
>  13 files changed, 600 insertions(+), 9 deletions(-)
>   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Avi Kivity
Jamie Lokier wrote:
> Avi Kivity wrote:
>   
>>> And video streaming on some embedded devices with no MMU!  (Due to the
>>> page cache heuristics working poorly with no MMU, sustained reliable
>>> streaming is managed with O_DIRECT and the app managing cache itself
>>> (like a database), and that needs AIO to keep the request queue busy.
>>> At least, that's the theory.)
>>>   
>> Could use threads as well, no?
>> 
>
> Perhaps.  This raises another point about AIO vs. threads:
>
> If I submit sequential O_DIRECT reads with aio_read(), will they enter
> the device read queue in the same order, and reach the disk in that
> order (allowing for reordering when worthwhile by the elevator)?
>   

Yes, unless the implementation in the kernel (or glibc) is threaded.

> With threads this isn't guaranteed and scheduling makes it quite
> likely to issue the parallel synchronous reads out of order, and for
> them to reach the disk out of order because the elevator doesn't see
> them simultaneously.
>   

If the disk is busy, it doesn't matter.  The requests will queue and the 
elevator will sort them out.  So it's just the first few requests that 
may get to disk out of order.

> With AIO (non-Glibc! (and non-kthreads)) it might be better at
> keeping the intended issue order, I'm not sure.
>
> It is highly desirable: O_DIRECT streaming performance depends on
> avoiding seeks (no reordering) and on keeping the request queue
> non-empty (no gap).
>
> I read a man page for some other unix, describing AIO as better than
> threaded parallel reads for reading tape drives because of this (tape
> seeks are very expensive).  But the rest of the man page didn't say
> anything more.  Unfortunately I don't remember where I read it.  I
> have no idea whether AIO submission order is nearly always preserved
> in general, or expected to be.
>   

I haven't considered tape, but this is a good point indeed.  I expect it 
doesn't make much of a difference for a loaded disk.

>   
>> It's me at fault here.  I just assumed that because it's easy to do aio 
>> in a thread pool efficiently, that's what glibc does.
>>
>> Unfortunately the code does some ridiculous things like not service 
>> multiple requests on a single fd in parallel.  I see absolutely no 
>> reason for it (the code says "fight for resources").
>> 
>
> Ouch.  Perhaps that relates to my thought above, about multiple
> requests to the same file causing seek storms when thread scheduling
> is unlucky?
>   

My first thought on seeing this is that it relates to a deficiency on 
older kernels servicing multiple requests on a single fd (i.e. a 
per-file lock).  I don't know if such a deficiency ever existed, though.

>   
>> It could and should.  It probably doesn't.
>>
>> A simple thread pool implementation could come within 10% of Linux aio 
>> for most workloads.  It will never be "exactly", but for small numbers 
>> of disks, close enough.
>> 
>
> I would wait for benchmark results for I/O patterns like sequential
> reading and writing, because of potential for seeks caused by request
> reordering, before being confident of that.
>
>   

I did have measurements (and a test rig) at a previous job (where I did 
a lot of I/O work); IIRC the performance of a tuned thread pool was not 
far behind aio, both for seeks and sequential.  It was a while back though.


-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Laurent Vivier

Le mardi 22 avril 2008 à 08:50 -0500, Anthony Liguori a écrit :
> Nguyen Anh Quynh wrote:
> > Hi,
> >
> > This should be submitted to upstream (but not to kvm-devel list), but
> > this is only the test code that I want to quickly send out for
> > comments. In case it looks OK, I will send it to upstream later.
> >
> > Inspired by extboot and conversations with Anthony and HPA, this
> > linuxboot option ROM is a simple option ROM that intercepts int19 in
> > order to execute linux setup code. This approach eliminates the need
> > to manipulate the boot sector for this purpose.
> >
> > To test it, just load linux kernel with your KVM/QEMU image using
> > -kernel option in normal way.
> >
> > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest
> > Ubuntu 8.04.
> >   
> 
> For the next rounds, could you actually rebase against upstream QEMU and 
> submit to qemu-devel?  One of Paul Brook's objections to extboot had 
> historically been that it wasn't not easily sharable with other 
> architectures.  With a C version, it seems more reasonable now to do that.

Moreover add a binary version of the ROM in the pc-bios directory: it
avoids to have a cross-compiler to build ROM on non-x86 architecture.

Regards,
Laurent

> Make sure you remove all the old linux boot code too within QEMU along 
> with the -hda checks.
> 
> Regards,
> 
> Anthony Liguori
> 
> > Thanks,
> > Quynh
> >
> >
> > # diffstat linuxboot1.diff
> >  Makefile |   13 -
> >  linuxboot/Makefile   |   40 +++
> >  linuxboot/boot.S |   54 +
> >  linuxboot/farvar.h   |  130 
> > +++
> >  linuxboot/rom.c  |  104 
> >  linuxboot/signrom|binary
> >  linuxboot/signrom.c  |  128 
> > ++
> >  linuxboot/util.h |   69 +++
> >  qemu/Makefile|3 -
> >  qemu/Makefile.target |2
> >  qemu/hw/linuxboot.c  |   39 +++
> >  qemu/hw/pc.c |   22 +++-
> >  qemu/hw/pc.h |5 +
> >  13 files changed, 600 insertions(+), 9 deletions(-)
> >   
> 
> 
> 
> 
-- 
- [EMAIL PROTECTED] ---
"The best way to predict the future is to invent it."
- Alan Kay


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Anthony Liguori
Jamie Lokier wrote:
> Avi Kivity wrote:
>   
>>> And video streaming on some embedded devices with no MMU!  (Due to the
>>> page cache heuristics working poorly with no MMU, sustained reliable
>>> streaming is managed with O_DIRECT and the app managing cache itself
>>> (like a database), and that needs AIO to keep the request queue busy.
>>> At least, that's the theory.)
>>>   
>> Could use threads as well, no?
>> 
>
> Perhaps.  This raises another point about AIO vs. threads:
>
> If I submit sequential O_DIRECT reads with aio_read(), will they enter
> the device read queue in the same order, and reach the disk in that
> order (allowing for reordering when worthwhile by the elevator)?
>   

There's no guarantee that any sort of order will be preserved by AIO 
requests.  The same is true with writes.  This is what fdsync is for, to 
guarantee ordering.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>  
>>> This patch changes virtio devices to be multi-function devices whenever
>>> possible.  This increases the number of virtio devices we can 
>>> support now by
>>> a factor of 8.
>>>
>>> With this patch, I've been able to launch a guest with either 220 
>>> disks or 220
>>> network adapters.
>>>
>>>   
>>
>> Does this play well with hotplug?  Perhaps we need to allocate a new 
>> device on hotplug.
>>   
>
> Probably not.  I imagine you can only hotplug devices, not individual 
> functions?
>

It sounds reasonable to expect so.  ACPI has objects for devices, not 
functions (IIRC).

Maybe require explicit device/function assignment on the command line?  
It will be managed anyway.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Avi Kivity wrote:
> Anthony Liguori wrote:
>   
>> This patch changes virtio devices to be multi-function devices whenever
>> possible.  This increases the number of virtio devices we can support now by
>> a factor of 8.
>>
>> With this patch, I've been able to launch a guest with either 220 disks or 
>> 220
>> network adapters.
>>
>>   
>> 
>
> Does this play well with hotplug?  Perhaps we need to allocate a new 
> device on hotplug.
>   

Probably not.  I imagine you can only hotplug devices, not individual 
functions?

Regards,

Anthony Liguori

> (certainly if we have a device with one function, which then gets 
> converted to a multifunction device)
>
>   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] KVM: PIT: make last_injected_time per-guest

2008-04-22 Thread Avi Kivity
Marcelo Tosatti wrote:
> Otherwise multiple guests use the same variable and boom.
>
> Also use kvm_vcpu_kick() to make sure that if a timer triggers on 
> a different CPU the event won't be missed.
>
>   

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Tuesday 22 April 2008 17:44:08 Christian Borntraeger wrote:
> Am Dienstag, 22. April 2008 schrieb Rusty Russell:
> > [Christian, Hollis, how much is this ABI breakage going to hurt you?]
>
> It is ok for s390 at the moment. We are still working on making userspace
> ready and I plan to change the guest<->host for s390 anyway. I try to make
> these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main
> reason is, that we are currently limited to around 80 devices. I am not
> sure, if I should change the allocation of the virtqueues and descriptors
> to guest memory as well.

Large rings require contiguous memory, which makes guest allocation 
problematic.  512 elems at 4k pages == 5 pages.

> Back to your patch:
> I have still some ideas about virtio between little endian and big endian
> systems, but it requires more and different marshalling anyway - even on
> driver level. No idea yet how to solve that properly.

So far we've pushed such considerations onto the host.  This does mean that 
you can't virtio connect two guests directly without understanding the 
contents of the buffers so you can endian correct (eg. direct inter-guest 
networking).  inter-guest virtio is currently a party trick anyway, so I'm 
not sure it's a real issue.

> > +   vb->vdev->config->get(vb->vdev,
> > + offsetof(struct virtio_balloon_config, num_pages),
> > + &v);
>
> this is missing a sizeof(v), no?

Ah... sure enough, I fixed that in a followon patch.  Well-spotted, thanks!

Cheers,
Rusty.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote:
> > The virtio config space was originally chosen to be little-endian,
> > because we thought the config might be part of the PCI config space
> > for virtio_pci.  It's actually a separate mmio region, so that
> > argument holds little water; as only x86 is currently using the virtio
> > mechanism, we can change this (but must do so now, before the
> > impending s390 and ppc merges).
>
> This will probably annoy Hollis which has guests that can go both ways.

Yes, I discussed this with Hollis.  But the virtio rings themselves already 
have this issue: we don't do any endian conversion on them and assume 
they're "our" endian in the guest.

We may still regret not doing *everything* little-endian, but this doesn't 
make it worse.

Thanks,
Rusty.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
> This patch changes virtio devices to be multi-function devices whenever
> possible.  This increases the number of virtio devices we can support now by
> a factor of 8.
>
> With this patch, I've been able to launch a guest with either 220 disks or 220
> network adapters.
>
>   

Does this play well with hotplug?  Perhaps we need to allocate a new 
device on hotplug.

(certainly if we have a device with one function, which then gets 
converted to a multifunction device)

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] KVM: PIT: make last_injected_time per-guest

2008-04-22 Thread Marcelo Tosatti

Otherwise multiple guests use the same variable and boom.

Also use kvm_vcpu_kick() to make sure that if a timer triggers on 
a different CPU the event won't be missed.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Tested-and-Acked-by: Alex Davis <[EMAIL PROTECTED]>

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 2852dd1..5697ad2 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -200,10 +200,8 @@ int __pit_timer_fn(struct kvm_kpit_state *ps)
 
atomic_inc(&pt->pending);
smp_mb__after_atomic_inc();
-   if (vcpu0 && waitqueue_active(&vcpu0->wq)) {
-   vcpu0->arch.mp_state = KVM_MP_STATE_RUNNABLE;
-   wake_up_interruptible(&vcpu0->wq);
-   }
+   if (vcpu0)
+   kvm_vcpu_kick(vcpu0);
 
pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
pt->scheduled = ktime_to_ns(pt->timer.expires);
@@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu)
struct kvm_pit *pit = vcpu->kvm->arch.vpit;
struct kvm *kvm = vcpu->kvm;
struct kvm_kpit_state *ps;
-   static unsigned long last_injected_time;
 
if (vcpu && pit) {
ps = &pit->pit_state;
@@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu)
 * 2. Last interrupt was accepted or waited for too long time*/
if (atomic_read(&ps->pit_timer.pending) &&
(ps->inject_pending ||
-   (jiffies - last_injected_time
+   (jiffies - ps->last_injected_time
>= KVM_MAX_PIT_INTR_INTERVAL))) {
ps->inject_pending = 0;
__inject_pit_timer_intr(kvm);
-   last_injected_time = jiffies;
+   ps->last_injected_time = jiffies;
}
}
 }
diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h
index e63ef38..db25c2a 100644
--- a/arch/x86/kvm/i8254.h
+++ b/arch/x86/kvm/i8254.h
@@ -35,6 +35,7 @@ struct kvm_kpit_state {
struct mutex lock;
struct kvm_pit *pit;
bool inject_pending; /* if inject pending interrupts */
+   unsigned long last_injected_time;
 };
 
 struct kvm_pit {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
This patch changes virtio devices to be multi-function devices whenever
possible.  This increases the number of virtio devices we can support now by
a factor of 8.

With this patch, I've been able to launch a guest with either 220 disks or 220
network adapters.

I haven't tested the Windows virtio drivers.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index 60e4094..df3a878 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -33,7 +33,7 @@ typedef struct PCIIORegion {
 #define PCI_ROM_SLOT 6
 #define PCI_NUM_REGIONS 7
 
-#define PCI_DEVICES_MAX 64
+#define PCI_DEVICES_MAX 256
 
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 9100bb1..9ea14d3 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 PCIDevice *pci_dev;
 uint8_t *config;
 uint32_t size;
+static int devfn = 7;
+
+if ((devfn % 8) == 7)
+   devfn = -1;
+else
+   devfn++;
 
 pci_dev = pci_register_device(bus, name, struct_size,
- -1, NULL, NULL);
+ devfn, NULL, NULL);
+
+devfn = pci_dev->devfn;
+
 vdev = to_virtio_device(pci_dev);
 
 vdev->status = 0;
@@ -435,6 +444,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 
 config[0x3d] = 1;
 
+/* Mark device as multi-function */
+if ((devfn % 8) == 0)
+   config[0x0e] |= 0x80;
+
 vdev->name = name;
 vdev->config_len = config_size;
 if (vdev->config_len)
diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..3bada75 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd);
 
 /* NIC info */
 
-#define MAX_NICS 8
+#define MAX_NICS 256
 
 struct NICInfo {
 uint8_t macaddr[6];
diff --git a/qemu/sysemu.h b/qemu/sysemu.h
index b645fb7..7992a77 100644
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -151,7 +151,7 @@ typedef struct DriveInfo {
 
 #define MAX_IDE_DEVS   2
 #define MAX_SCSI_DEVS  7
-#define MAX_DRIVES 32
+#define MAX_DRIVES 256
 
 int nb_drives;
 DriveInfo drives_table[MAX_DRIVES+1];
diff --git a/qemu/vl.c b/qemu/vl.c
index 7dd0094..e203a4d 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -8754,7 +8754,7 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type)
 }
 #endif
 
-#define MAX_NET_CLIENTS 32
+#define MAX_NET_CLIENTS 512
 
 static int saved_argc;
 static char **saved_argv;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292
# Parent  8965539f4d174c79bd37e58e8b037d5db906e219
The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(&mapping->i_mmap_lock);
+   down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
 out:
-   spin_unlock(&mapping->i_mmap_lock);
+   up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@
pgoff = offset >> PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(&mapping->i_mmap_lock);
+   down_read(&mapping->i_mmap_sem);
if (!prio_tree_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-   spin_unlock(&mapping->i_mmap_lock);
+   up_read(&mapping->i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@
INIT_LIST_HEAD(&inode->i_devices);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
rwlock_init(&inode->i_data.tree_lock);
-   spin_lock_init(&inode->i_data.i_mmap_lock);
+   init_rwsem(&inode->i_data.i_mmap_sem);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -503,7 +503,7 @@
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,7 +716,7 @@
struct address_space *check_mapping;/* Check page->mapping if set */
pgoff_t first_index;/* Lowest page->index to unmap 
*/
pgoff_t last_index; /* Highest page->index to unmap 
*/
-   spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */
+   struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */
unsigned long truncate_count;   /* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -274,12 +274,12 @@
atomic_dec(&inode->i_writecount);
 
/* insert tmp into the share list, just after mpnt */
-   spin_lock(&file->f_mapping->i_mmap_lock);
+   down_write(&file->f_mapping->i_mmap_sem);
tmp->vm_truncate_count = mpnt->vm_truncate_count;
flush_dcache_mmap_lock(file->f_mapping);
vma_prio_tree_add(tmp, mpnt);
flush_dcache_mmap_unlock(file->f_mapping);
-   spin_unlock(&fi

[kvm-devel] [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
# Parent  6e04df1f4284689b1c46e57a67559abe49ecf292
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(&anon_vma->refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma->anon_vma;
if (anon_vma)
-   spin_lock(&anon_vma->lock);
+   down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma->anon_vma;
if (anon_vma)
-   spin_unlock(&anon_vma->lock);
+   up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(&anon_vma->lock);
+   down_read(&anon_vma->sem);
 
list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(&anon_vma->lock);
+   up_read(&anon_vma->sem);
 }
 
 /*
@@ -623,7 +624,7 @@
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -647,16 +648,14 @@
}
/*
 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page() in migration, then,
 * just care Anon page here.
 */
-   if (PageAnon(page)) {
-   rcu_read_lock();
-   rcu_locked = 1;
-   }
+   if (PageAnon(page))
+   

[kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872186 -7200
# Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7
# Parent  3c804dca25b15017b22008647783d6f5f3801fa9
get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -442,7 +442,8 @@
if (task->flags & PF_BORROWED_MM)
mm = NULL;
else
-   atomic_inc(&mm->mm_users);
+   if (!atomic_inc_not_zero(&mm->mm_users))
+   mm = NULL;
}
task_unlock(task);
return mm;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Andrea Arcangeli
Hello,

This is the latest and greatest version of the mmu notifier patch #v13.

Changes are mainly in the mm_lock that uses sort() suggested by Christoph.
This reduces the complexity from O(N**2) to O(N*log(N)).

I folded the mm_lock functionality together with the mmu-notifier-core 1/12
patch to make it self-contained. I recommend merging 1/12 into -mm/mainline
ASAP. Lack of mmu notifiers is holding off KVM development. We are going to
rework the way the pages are mapped and unmapped to work with pure pfn for pci
passthrough without the use of page pinning, and we can't without mmu
notifiers. This is not just a performance matter.

KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch
that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is
zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed
further.

XPMEM support comes with the later patches 2-12, risk for those patches is >0
and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are
simple and can go in immediately but not all are so simple.

2-12/12 are posted as usual for review by the VM developers and so Robin can
keep testing them on XPMEM and they can be merged later without any downside
(they're mostly orthogonal with 1/12).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID 8965539f4d174c79bd37e58e8b037d5db906e219
# Parent  fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -59,6 +59,8 @@
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1984,6 +1984,15 @@
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+   return rwsem_is_contended(sem);
+#else
+   return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@
spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@
return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+   /*
+* Racy check for an empty list. False positives or negatives
+* would be okay. False positive may cause a useless dropping of
+* locks. False negatives may cause locks to be held a bit
+* longer until the next check.
+*/
+   return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872186 -7200
# Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
# Parent  ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,8 +723,7 @@
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,7 +804,6 @@
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -816,20 +815,13 @@
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -838,9 +830,14 @@
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-   int fullmm = (*tlbp)->fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
struct mm_struct *mm = vma->vm_mm;
 
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb->fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
unsigned long end;
@@ -867,7 +864,7 @@
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, &zap_work, details);
 
if (zap_work > 0) {
@@ -875,22 +872,23 @@
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
 
if (need_resched() ||
(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
-   *tlbp = NULL;
+   tlb = NULL;
goto out;
}
cond_resched();
}
 
-   *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
@@ -906,18 +904,10 @@
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
 {
-   struct mm_struct *mm = vma->vm_mm;
-   struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
 
-   lru_add_drain();
-   tlb = tlb_gather_mmu(mm, 0);

[kvm-devel] [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID 128d705f38c8a774ac11559db445787ce6e91c77
# Parent  f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <[EMAIL PROTECTED]>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -909,6 +909,7 @@
 
return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID e847039ee2e815088661933b7195584847dc7540
# Parent  128d705f38c8a774ac11559db445787ce6e91c77
This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <[EMAIL PROTECTED]>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@
  *
  *  ->i_mutex  (generic_file_buffered_write)
  *->mmap_sem   (fault_in_pages_readable->do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  ->i_mutex
  *->i_alloc_sem (various)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872186 -7200
# Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
# Parent  ac9bb1fb3de2aa5d27210a28edf24f6577094076
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+   tlb = tlb_gather_mmu(vma->vm_mm, 0);
+   hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma->vm_end,
+   tlb = tlb_gather_mmu(vma->vm_mm, 0);
+   free_pgd_range(&tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
}
+   tlb_finish_mmu(tlb, addr, vma->vm_end);
vma = next;
}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1752,9 +1752,9 @@
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 next? next->vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2050,8 +2050,8 @@
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli <[EMAIL PROTECTED]>
# Date 1208872187 -7200
# Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
# Parent  bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1062,10 +1062,10 @@
  * mm_lock and mm_unlock are expensive operations that may take a long time.
  */
 struct mm_lock_data {
-   spinlock_t **i_mmap_locks;
-   spinlock_t **anon_vma_locks;
-   size_t nr_i_mmap_locks;
-   size_t nr_anon_vma_locks;
+   struct rw_semaphore **i_mmap_sems;
+   struct rw_semaphore **anon_vma_sems;
+   size_t nr_i_mmap_sems;
+   size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,8 +2243,8 @@
 static int mm_lock_cmp(const void *a, const void *b)
 {
cond_resched();
-   if ((unsigned long)*(spinlock_t **)a <
-   (unsigned long)*(spinlock_t **)b)
+   if ((unsigned long)*(struct rw_semaphore **)a <
+   (unsigned long)*(struct rw_semaphore **)b)
return -1;
else if (a == b)
return 0;
@@ -2252,7 +2252,7 @@
return 1;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore 
**sems,
  int anon)
 {
struct vm_area_struct *vma;
@@ -2261,59 +2261,59 @@
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (anon) {
if (vma->anon_vma)
-   locks[i++] = &vma->anon_vma->lock;
+   sems[i++] = &vma->anon_vma->sem;
} else {
if (vma->vm_file && vma->vm_file->f_mapping)
-   locks[i++] = 
&vma->vm_file->f_mapping->i_mmap_lock;
+   sems[i++] = 
&vma->vm_file->f_mapping->i_mmap_sem;
}
}
 
if (!i)
goto out;
 
-   sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+   sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 1);
+   return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-   spinlock_t **locks)
+   struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 0);
+   return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-   spinlock_t *last = NULL;
+   struct rw_semaphore *last = NULL;
size_t i;
 
for (i = 0; i < nr; i++)
/*  Multiple vmas may use the same lock. */
-   if (locks[i] != last) {
-   BUG_ON((unsigned long) last > (unsigned long) locks[i]);
-   last = locks[i];
+   if (sems[i] != last) {
+   BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+   last = sems[i];
if (lock)
-   spin_lock(last);
+   down_write(last);
else
-   spin_unlock(last);
+   up_write(last);
}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 1);
+   mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 0);
+   mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2325,57 +2325,57 @@
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-   spinlock_t **anon_vma_locks, **i_mmap_locks;
+   struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
down_write(&mm->mmap_sem);
if (mm->map_count) {
-   anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-   if (unlikely(!anon_vma_locks)) {
+   anon_vma_sems = vmalloc(sizeof(struct rw_semaph

  1   2   >