Re: [kvm-devel] [PATCH] Make virtio devices multi-function
* Anthony Liguori ([EMAIL PROTECTED]) wrote: > Logically speaking, virtio is a bus. virtio supports all of the features > of a bus (discover, hot add, hot remove). > > Right now, we map virtio devices directly onto the PCI bus. > > The problem we're trying to address is limitations of the PCI bus. We have > a couple options: First question is if we have a real limitation with multiple busses? > 1) add a virtio device that supports multiple disks. we need to reinvent > hotplug within this device. > > 2) add a new PCI virtio transport that supports multiple virtio-blk devices > within a single PCI slot > > 3) add a generic PCI virtio transport that supports multiple virtio devices > within a single PCI slot compare and contrast above with HBA and disks (makes most sense from my point of view). for 2 and 3, only difference is whether you want to be able to support nics, balloons, and block devices on same pci slot (at which point it's a bridge, how is it different from 4?) > 4) add a generic virtio "bridge" that supports multiple virtio devices > within a single virtio device. > > #4 may seem strange, but it's no different from a PCI-to-PCI bridge. > > I like #4 the most, but #2 is probably the most practical. Also, your current patch does not work for hotplug disk. thanks, -chris - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: > BTW, I've never been that convinced that hotplugging devices is as > useful as people make it out to be. I also think that's particularly > true when it comes to hot adding/removing very large numbers of disks. > > On the contrary, the more disks you have, the more likely one is to fail, so you'd need to hotreplace it (think a setup with redundancy like zfs). -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Ian Kirk wrote: > Avi Kivity wrote: > > >> For mass storage, we should follow the SCSI model with a single device >> serving multiple disks, similar to what you suggest. Not sure if the >> device should have a single queue or one queue per disk. >> > > Don't you just end up re-implementing SCSI then, at which point you might > as well stick with a 'fake' SCSI device in the guest? > A virtio-scsi controller is indeed useful as it can control tapes, media changers, and other fancy stuff in addition to ordinary disks. For disks, I'd like to avoid the overhead of scsi command generation and parsing. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()
Thomas Cataldo wrote: > On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo > <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit >> host). >> >> Got the following in my logs today : >> >> Apr 21 17:55:01 buffy kernel: WARNING: at >> /usr/src/modules/kvm/mmu.c:390 account_shadowed() >> Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P >> 2.6.24-1-686 #1 >> Apr 21 17:55:01 buffy kernel: [] kvm_mmu_get_page+0x42d/0x447 >> [kvm] >> Apr 21 17:55:01 buffy kernel: [] kvm_mmu_load+0xdf/0x15c [kvm] >> Apr 21 17:55:01 buffy kernel: [] >> vmx_queue_exception+0x0/0x33 [kvm_intel] >> Apr 21 17:55:01 buffy kernel: [] >> kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm] >> Apr 21 17:55:01 buffy kernel: [] kvm_vcpu_ioctl+0xe4/0x34c [kvm] >> Apr 21 17:55:01 buffy kernel: [] delayacct_end+0x70/0x77 >> Apr 21 17:55:01 buffy kernel: [] sync_page+0x0/0x3b >> Apr 21 17:55:01 buffy kernel: [] __delayacct_blkio_end+0x5b/0x5f >> Apr 21 17:55:01 buffy kernel: [] io_schedule+0x64/0x80 >> Apr 21 17:55:01 buffy kernel: [] enqueue_entity+0x2b/0x3d >> Apr 21 17:55:01 buffy kernel: [] apic_wait_icr_idle+0xe/0x15 >> Apr 21 17:55:01 buffy kernel: [] enqueue_task_fair+0x16/0x24 >> Apr 21 17:55:01 buffy kernel: [] enqueue_task+0x52/0x5d >> Apr 21 17:55:01 buffy kernel: [] resched_task+0x52/0x54 >> Apr 21 17:55:01 buffy kernel: [] try_to_wake_up+0x2b8/0x2c2 >> Apr 21 17:55:01 buffy kernel: [] __wake_up_common+0x32/0x5c >> Apr 21 17:55:01 buffy kernel: [] __wake_up+0x32/0x42 >> Apr 21 17:55:01 buffy kernel: [] wake_futex+0x3b/0x45 >> Apr 21 17:55:01 buffy kernel: [] futex_wake+0x81/0xb0 >> Apr 21 17:55:01 buffy kernel: [] do_futex+0x77/0x983 >> Apr 21 17:55:01 buffy kernel: [] update_curr+0x62/0xef >> Apr 21 17:55:01 buffy kernel: [] __switch_to+0x9d/0x11d >> Apr 21 17:55:01 buffy kernel: [] kvm_vcpu_ioctl+0x0/0x34c [kvm] >> Apr 21 17:55:01 buffy kernel: [] do_ioctl+0x1f/0x62 >> Apr 21 17:55:01 buffy kernel: [] vfs_ioctl+0x237/0x249 >> Apr 21 17:55:01 buffy kernel: [] sys_ioctl+0x45/0x5d >> Apr 21 17:55:01 buffy kernel: [] sysenter_past_esp+0x6b/0xa1 >> >> >> Regards, >> Thomas. >> >> > > as I got no reply, I guess it is a bad setup on my part. If that might > help, this happenned while I was doing a "make -j" on webkit svn tree > (ie. heavy c++ compilation workload) . > > No this is not bad setup. No amount of bad setup should give this warning. You didn't get a reply because no one knows what to make of it, and because it's much more fun to debate endianess or contemplete guests with eighty thousand disks than to fix those impossible bugs. If you can give clear instructions on how to reproduce this, we will try it out. Please be sure to state OS name and versions for the guest as well as the host. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Stupid Newbee Questions...
Stuart Sheldon wrote: > 2) When I started writing the management scripts to start and stop the > guests from the command line, I was using KVM-63 which allowed me to > send a "system_powerdown" to the console, this would send a PWR to the > guest's acpid that would bring the guest down gracefully. This stopped > working in kvm-64 and is still not working in kvm-66. Is there a better > way to do this? How can the host bring down a guest without messing with > it's open programs? > This is not a question, it's a bug report. It's a clear regression that needs to be fixed, not worked around. Please send timely reports of such issues when you encounter them. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(
Glauber Costa wrote: > Gerd Hoffmann wrote: >> Jeremy Fitzhardinge wrote: >>> Xen could change the parameters in the instant after >>> get_time_values(). That change could be as a result of >>> suspend-resume, so the parameters >>> and the tsc could be wildly different. >> >> Ah, ok, forgot the rdtsc in the picture. With that in mind I fully >> agree that the loop is needed. I think kvm guests can even hit that one >> with the vcpu migrating to a different physical cpu, so we better handle >> it correctly ;) > > It's probably not needed for kvm, since we update everything everytime > we get scheduled in the host side, which would cover the case for > migration between physical cpus. No, it wouldn't. The corner case we must catch is: guest reads time info, kvm reschedules the guest to another pcpu, guest reads the tsc. The time info used by the guest for the tsc delta is stale then, it belongs to the previous pcpu. cheers, Gerd -- http://kraxel.fedorapeople.org/xenner/ - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Stupid Newbee Questions...
* On Wednesday 23 Apr 2008 05:20:03 Stuart Sheldon wrote: > I've looked around but can't seem to find these answers. > > I'm using KVM to run multiple servers on the same hardware, but it seems > that most of the documentation written is for desktop use. > > I'm currently running KVM-66 on a 2.6.24.4 kernel and using the KVM > provided modules. This is sitting on a Debian Lenny install with Intel > hardware. > > Here are my questions: > > 1) Is there a way to give one virtual host priority over another? Would > I just add -cpu 2 on the host I want to give priority? Renicing the guest you want to give more priority to is a straightforward way of boosting the priority. It's easy to imagine that, since each guest is just a process on the host. > 2) When I started writing the management scripts to start and stop the > guests from the command line, I was using KVM-63 which allowed me to > send a "system_powerdown" to the console, this would send a PWR to the > guest's acpid that would bring the guest down gracefully. This stopped > working in kvm-64 and is still not working in kvm-66. Is there a better > way to do this? How can the host bring down a guest without messing with > it's open programs? I've no clue about this one; will let someone else answer it. > 3) Are there any statistics available from to the host os that can be > monitored, such as something in /sys or /proc? Can you mention what kinds of stats you're looking for? Since each guest is a normal process, a lot of information can be already examined by ps, top and /proc//... There also is the kvm_stat that comes with the kvm-userspace which lets you monitor guest exits into the host. Those are mainly for debugging purposes though. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)
Marcelo Tosatti wrote: > On Tue, Apr 22, 2008 at 12:39:57PM -0600, Alberto Treviño wrote: > >> Thanks for all those who work on KVM. It is a wonderful product and I >> have been very impressed with its features, performance, and the level >> of activity in this project. >> >> Back in February a bug was filed. I've been hit by this bug as well, >> but there hasn't been much activity with it in the last little bit. I >> wanted to know if anyone had a fix for it, or a workaround (other than >> using IDE), or whether it was on someone's radar. Here is a link to >> the bug: >> >> http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831 >> > > http://article.gmane.org/gmane.comp.emulators.qemu/24192 > > BTW, Avi, this patch should be included in kvm-userspace. > I've tried this out about a week ago and didn't get very good results. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Классификация запасов
В санкт-Петербурге в перид с 19 по 21 мая будет проходить информационный курс, посвященный логистике запасов. -- Управление запасами в современной компании -- 19 - 21 мая 2008г. Санкт-Петербург Актуальность темы мероприятия: - ∙ Структурировать имеющийся опыт и знания в области современной логистики с целью оптимизации управления запасами на предприятии; ∙ Освоить метод работы со схемой бизнес-процессов; ∙ Получить инструменты для выполнения рабочих операций на определенных этапах закупок; ∙ Познакомиться с методами оптимизации процесса закупок/хранения. Темы, включенные в программу: Система логистики в компании: ∙ Функции отдела логистики. ∙ Основные блоки логистики. ∙ Основной результат работы логистики Компании - оптимизация расходов. Оптимальный выбор партнеров по бизнесу - минимум расходов: ∙ Технология выбора и оценки поставщика. ∙ Определение критериев для выбора поставщика. ∙ Процесс выбора поставщика. Ранжирование поставщиков. ∙ Методика оценки работы поставщика. ∙ Контроль складских запасов по показателям оборачиваемости. Проблематика управления запасами: ∙ Цели создания запасов и причины их повышения. ∙ Риски создания и поддержания запасов. ∙ Возможности снижения уровня запасов. Анализ и дифференциация ассортимента: ∙ Неликвиды. Последствия дефицита. Взаимозаменяемые товары. ∙ АВС-анализ в управлении запасами. ∙ Использование метода XYZ. ∙ Матрица ABC-XYZ и ее использование. Прогнозирование потребности в управлении запасами: ∙ Статистические методы прогнозирования. ∙ Учет сезонных колебаний. ∙ Экспертная оценка. Отчетность при управлении запасами: ∙ Показатели для контроля и анализа деятельности по управлению запасами. ∙ Калькуляция затрат, связанных с управлением запасами. ∙ Пути снижения затрат. ∙ Разработка отчетов и периодичность их составления. Параметры поставок: ∙ Виды запасов. Классификация запасов. ∙ Логистический цикл запасов. ∙ Определение оптимального размера заказа. ∙ Потребность в страховых запасах. Системы управления запасами. ∙ Модель с фиксированным размером заказа. ∙ Модель с фиксированным интервалом времени между заказами. ∙ Модель управления запасами с установленной периодичностью. ∙ Модель управления запасами "Минимум-максимум". ∙ JIT - Точно в срок. ∙ Управление многономенклатурными запасами. Логистический прогноз: основы, методы, единицы измерения запасов на различных стадиях логистического процесса. Методы логистического планирования: ∙ Календарное планирование. ∙ Оценка запасов. ∙ Методика расчета оптимальных запасов. Подробная программа, условия участия и дополнительная информация: (812) 98-35-439 - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)
On Tue, Apr 22, 2008 at 12:39:57PM -0600, Alberto Treviño wrote: > Thanks for all those who work on KVM. It is a wonderful product and I > have been very impressed with its features, performance, and the level > of activity in this project. > > Back in February a bug was filed. I've been hit by this bug as well, > but there hasn't been much activity with it in the last little bit. I > wanted to know if anyone had a fix for it, or a workaround (other than > using IDE), or whether it was on someone's radar. Here is a link to > the bug: > > http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831 http://article.gmane.org/gmane.comp.emulators.qemu/24192 BTW, Avi, this patch should be included in kvm-userspace. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, Apr 22, 2008 at 03:51:16PM +0200, Andrea Arcangeli wrote: > Hello, > > This is the latest and greatest version of the mmu notifier patch #v13. > FWIW, I have updated the GRU driver to use this patch (plus the fixeups). No problems. AFAICT, everything works. --- jack - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > The only other change I did has been to move mmu_notifier_unregister > > at the end of the patchset after getting more questions about its > > reliability and I documented a bit the rmmod requirements for > > ->release. we'll think later if it makes sense to add it, nobody's > > using it anyway. > > XPMEM is using it. GRU will be as well (probably already does). Yeppp. The GRU driver unregisters the notifier when all GRU mappings are unmapped. I could make it work either way - either with or without an unregister function. However, unregister is the most logical action to take when all mappings have been destroyed. --- jack - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 17:13:01 Anthony Liguori wrote: > Hollis Blanchard wrote: > > On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote: > > > >> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: > >> > >>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: > >>> > We may still regret not doing *everything* little-endian, but this > doesn't make it worse. > > >>> Hmm, why *don't* we just do everything LE, including the ring? > >>> > >> Mainly because when requirements are in doubt, simplicity wins, I think. > >> > > > > Well, I think the definition of simplicity is up for debate in this > > case... "LE everywhere" is much simpler than "it depends", IMHO. > > > > You couldn't use the vringfd direct ring mapping optimization in KVM for > PPC without teaching the kernel to access a vring in LE format. I'm > pretty sure the later would get rejected on LKML anyway for vringfd as a > generic mechanism. (Since the IA64 guys have already implemented BE guests on LE hosts, they should be aware of this discussion too, which is why I've CCed them.) After a short but torturous whiteboard session, followed by a much longer but less painful discussion, I'm fine with the virtio device config space being BE for PowerPC and LE for x86. In the future, we can use a feature bit to indicate that PCI config space contains an explicit endianness flag. (This will be set to BE or LE, *not* to "opposite of normal", because "normal" is also too vague.) -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Stupid Newbee Questions...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I've looked around but can't seem to find these answers. I'm using KVM to run multiple servers on the same hardware, but it seems that most of the documentation written is for desktop use. I'm currently running KVM-66 on a 2.6.24.4 kernel and using the KVM provided modules. This is sitting on a Debian Lenny install with Intel hardware. Here are my questions: 1) Is there a way to give one virtual host priority over another? Would I just add -cpu 2 on the host I want to give priority? 2) When I started writing the management scripts to start and stop the guests from the command line, I was using KVM-63 which allowed me to send a "system_powerdown" to the console, this would send a PWR to the guest's acpid that would bring the guest down gracefully. This stopped working in kvm-64 and is still not working in kvm-66. Is there a better way to do this? How can the host bring down a guest without messing with it's open programs? 3) Are there any statistics available from to the host os that can be monitored, such as something in /sys or /proc? Thanks in advance to all that answer! Stu - -- Open up the window Let some air into this room I think I'm almost chokin' From the smell of stale perfume And that cigarette you're smoking 'Bout scared me half to death Open up the window, sucker Let me catch my breath -- Three Dog Night - "Mama Told Me Not to Come - Lyrics" -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIDnmrK69Y+xPZrWYRAp5yAKCev9Y6nDjILy4/z9hYGWnAFVUoTgCePAB6 esRDa/IPy3vb27KL13WKZBw= =UHPE -END PGP SIGNATURE- - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > I'll send an update in any case to Andrew way before Saturday so > hopefully we'll finally get mmu-notifiers-core merged before next > week. Also I'm not updating my mmu-notifier-core patch anymore except > for strict bugfixes so don't worry about any more cosmetical bugs > being introduced while optimizing the code like it happened this time. I guess I have to prepare another patchset then? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > The right patch ordering isn't necessarily the one that reduces the > total number of lines in the patchsets. The mmu-notifier-core is > already converged and can go in. The rest isn't converged at > all... nearly nobody commented on the other part (the few comments so > far were negative), so there's no good reason to delay indefinitely > what is already converged, given it's already feature complete for > certain users of the code. My patch ordering looks more natural to > me. What is finished goes in, the rest is orthogonal anyway. I would not want to review code that is later reverted or essentially changed in later patches. I only review your patches because we have a high interest in the patch. I suspect that others will be more willing to review this material if it would be done the right way. If you cannot produce an easily reviewable and properly formatted patchset that follows conventions then I will have to do it because we really need to get this merged. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote: > > Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? > > To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the > mmu notifiers aren't registered by GRU or KVM. Keep in mind that the > whole point of my proposed patch ordering from day 0, is to keep as > 1/N, the absolutely minimum change that fully satisfy GRU and KVM > requirements. 4/12 isn't required by GRU/KVM so I keep it in a later > patch. I now moved mmu_notifier_unregister in a later patch too for > the same reason. We want a full solution and this kind of patching makes the patches difficuilt to review because later patches revert earlier ones. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote: > > Missing signoff by you. > > I thought I had to signoff if I conributed with anything that could > resemble copyright? Given I only merged that patch, I can add an > Acked-by if you like, but merging this in my patchset was already an > implicit ack ;-). No you have to include a signoff if the patch goes through your custody chain. This one did. Also add a From: Christoph Lameter <[EMAIL PROTECTED]> somewhere if you want to signify that the patch came from me. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
On Wed, Apr 23, 2008 at 12:43:52AM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote: > > Looks like this is not complete. There are numerous .h files missing which > > means that various structs are undefined (fs.h and rmap.h are needed > > f.e.) which leads to surprises when dereferencing fields of these struct. > > > > It seems that mm_types.h is expected to be included only in certain > > contexts. Could you make sure to include all necessary .h files? Or add > > some docs to clarify the situation here. > > Robin, what other changes did you need to compile? I only did that one > because I didn't hear any more feedback from you after I sent that > patch, so I assumed it was enough. It was perfect. Nothing else was needed. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
> The only other change I did has been to move mmu_notifier_unregister > at the end of the patchset after getting more questions about its > reliability and I documented a bit the rmmod requirements for > ->release. we'll think later if it makes sense to add it, nobody's > using it anyway. XPMEM is using it. GRU will be as well (probably already does). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
On Tue, Apr 22, 2008 at 01:26:13PM -0700, Christoph Lameter wrote: > Doing the right patch ordering would have avoided this patch and allow > better review. I didn't actually write this patch myself. This did it instead: s/anon_vma_lock/anon_vma_sem/ s/i_mmap_lock/i_mmap_sem/ s/locks/sems/ s/spinlock_t/struct rw_semaphore/ so it didn't look a big deal to redo it indefinitely. The right patch ordering isn't necessarily the one that reduces the total number of lines in the patchsets. The mmu-notifier-core is already converged and can go in. The rest isn't converged at all... nearly nobody commented on the other part (the few comments so far were negative), so there's no good reason to delay indefinitely what is already converged, given it's already feature complete for certain users of the code. My patch ordering looks more natural to me. What is finished goes in, the rest is orthogonal anyway. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote: > Looks like this is not complete. There are numerous .h files missing which > means that various structs are undefined (fs.h and rmap.h are needed > f.e.) which leads to surprises when dereferencing fields of these struct. > > It seems that mm_types.h is expected to be included only in certain > contexts. Could you make sure to include all necessary .h files? Or add > some docs to clarify the situation here. Robin, what other changes did you need to compile? I only did that one because I didn't hear any more feedback from you after I sent that patch, so I assumed it was enough. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
On Tue, Apr 22, 2008 at 01:24:21PM -0700, Christoph Lameter wrote: > Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the mmu notifiers aren't registered by GRU or KVM. Keep in mind that the whole point of my proposed patch ordering from day 0, is to keep as 1/N, the absolutely minimum change that fully satisfy GRU and KVM requirements. 4/12 isn't required by GRU/KVM so I keep it in a later patch. I now moved mmu_notifier_unregister in a later patch too for the same reason. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
On Tue, Apr 22, 2008 at 01:23:16PM -0700, Christoph Lameter wrote: > Missing signoff by you. I thought I had to signoff if I conributed with anything that could resemble copyright? Given I only merged that patch, I can add an Acked-by if you like, but merging this in my patchset was already an implicit ack ;-). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: > 3. As noted by Eric and also contained in private post from yesterday by >me: The cmp function needs to retrieve the value before >doing comparisons which is not done for the == of a and b. I retrieved the value, which is why mm_lock works perfectly on #v13 as well as #v12. It's not mandatory to ever return 0, so it won't produce any runtime error (there is a bugcheck for wrong sort ordering in my patch just in case it would generate any runtime error and it never did, or I would have noticed before submission), which is why I didn't need to release any hotfix yet and I'm waiting more time to get more comments before sending an update to clean up that bit. Mentioning this as the third and last point I guess shows how strong are your arguments against merging my mmu-notifier-core now, so in the end doing that cosmetical error payed off somehow. I'll send an update in any case to Andrew way before Saturday so hopefully we'll finally get mmu-notifiers-core merged before next week. Also I'm not updating my mmu-notifier-core patch anymore except for strict bugfixes so don't worry about any more cosmetical bugs being introduced while optimizing the code like it happened this time. The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for ->release. we'll think later if it makes sense to add it, nobody's using it anyway. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 17:13:01 Anthony Liguori wrote: > Hollis Blanchard wrote: > > On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote: > > > >> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: > >> > >>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: > >>> > We may still regret not doing *everything* little-endian, but this > doesn't make it worse. > > >>> Hmm, why *don't* we just do everything LE, including the ring? > >>> > >> Mainly because when requirements are in doubt, simplicity wins, I think. > >> > > > > Well, I think the definition of simplicity is up for debate in this > > case... "LE everywhere" is much simpler than "it depends", IMHO. > > You couldn't use the vringfd direct ring mapping optimization in KVM for > PPC without teaching the kernel to access a vring in LE format. I'm > pretty sure the later would get rejected on LKML anyway for vringfd as a > generic mechanism. You mean vringfd for use cases other than virtual IO drivers? I have a poor imagination; can you give some examples? Even then, it should be possible to have VIO drivers use a different set of accessors, just like there are swapping and non-swapping accessors for real IO, so I still don't see the problem. -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
Hollis Blanchard wrote: > On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote: > >> On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: >> >>> On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: >>> We may still regret not doing *everything* little-endian, but this doesn't make it worse. >>> Hmm, why *don't* we just do everything LE, including the ring? >>> >> Mainly because when requirements are in doubt, simplicity wins, I think. >> > > Well, I think the definition of simplicity is up for debate in this > case... "LE everywhere" is much simpler than "it depends", IMHO. > You couldn't use the vringfd direct ring mapping optimization in KVM for PPC without teaching the kernel to access a vring in LE format. I'm pretty sure the later would get rejected on LKML anyway for vringfd as a generic mechanism. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()
On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo <[EMAIL PROTECTED]> wrote: > Hi, > > I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit > host). > > Got the following in my logs today : > > Apr 21 17:55:01 buffy kernel: WARNING: at > /usr/src/modules/kvm/mmu.c:390 account_shadowed() > Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P > 2.6.24-1-686 #1 > Apr 21 17:55:01 buffy kernel: [] kvm_mmu_get_page+0x42d/0x447 > [kvm] > Apr 21 17:55:01 buffy kernel: [] kvm_mmu_load+0xdf/0x15c [kvm] > Apr 21 17:55:01 buffy kernel: [] > vmx_queue_exception+0x0/0x33 [kvm_intel] > Apr 21 17:55:01 buffy kernel: [] > kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm] > Apr 21 17:55:01 buffy kernel: [] kvm_vcpu_ioctl+0xe4/0x34c [kvm] > Apr 21 17:55:01 buffy kernel: [] delayacct_end+0x70/0x77 > Apr 21 17:55:01 buffy kernel: [] sync_page+0x0/0x3b > Apr 21 17:55:01 buffy kernel: [] __delayacct_blkio_end+0x5b/0x5f > Apr 21 17:55:01 buffy kernel: [] io_schedule+0x64/0x80 > Apr 21 17:55:01 buffy kernel: [] enqueue_entity+0x2b/0x3d > Apr 21 17:55:01 buffy kernel: [] apic_wait_icr_idle+0xe/0x15 > Apr 21 17:55:01 buffy kernel: [] enqueue_task_fair+0x16/0x24 > Apr 21 17:55:01 buffy kernel: [] enqueue_task+0x52/0x5d > Apr 21 17:55:01 buffy kernel: [] resched_task+0x52/0x54 > Apr 21 17:55:01 buffy kernel: [] try_to_wake_up+0x2b8/0x2c2 > Apr 21 17:55:01 buffy kernel: [] __wake_up_common+0x32/0x5c > Apr 21 17:55:01 buffy kernel: [] __wake_up+0x32/0x42 > Apr 21 17:55:01 buffy kernel: [] wake_futex+0x3b/0x45 > Apr 21 17:55:01 buffy kernel: [] futex_wake+0x81/0xb0 > Apr 21 17:55:01 buffy kernel: [] do_futex+0x77/0x983 > Apr 21 17:55:01 buffy kernel: [] update_curr+0x62/0xef > Apr 21 17:55:01 buffy kernel: [] __switch_to+0x9d/0x11d > Apr 21 17:55:01 buffy kernel: [] kvm_vcpu_ioctl+0x0/0x34c [kvm] > Apr 21 17:55:01 buffy kernel: [] do_ioctl+0x1f/0x62 > Apr 21 17:55:01 buffy kernel: [] vfs_ioctl+0x237/0x249 > Apr 21 17:55:01 buffy kernel: [] sys_ioctl+0x45/0x5d > Apr 21 17:55:01 buffy kernel: [] sysenter_past_esp+0x6b/0xa1 > > > Regards, > Thomas. > as I got no reply, I guess it is a bad setup on my part. If that might help, this happenned while I was doing a "make -j" on webkit svn tree (ie. heavy c++ compilation workload) . - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote: > On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: > > On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: > > > We may still regret not doing *everything* little-endian, but this > > > doesn't make it worse. > > > > Hmm, why *don't* we just do everything LE, including the ring? > > Mainly because when requirements are in doubt, simplicity wins, I think. Well, I think the definition of simplicity is up for debate in this case... "LE everywhere" is much simpler than "it depends", IMHO. -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: > On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: > > We may still regret not doing *everything* little-endian, but this > > doesn't make it worse. > > Hmm, why *don't* we just do everything LE, including the ring? Mainly because when requirements are in doubt, simplicity wins, I think. Cheers, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Make virtio devices multi-function (v2)
This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Since v1, I've changed the way virtio devices are allocated to be as follows: 1) Always use a slot as long as they are available. We can extend this to use a PCI when we get that working more reliably. 2) When PCI slots are exhausted, fall back add device as an additional function on an existing slot This way, hotplug continues to work just as well as it does now. Once you exceed the number of PCI slots, you need an OS that can do hotplug of individual PCI functions if you care about doing hotplug. I think this is a pretty reasonable trade-off. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c index a23a466..5d5d1a5 100644 --- a/qemu/hw/pci.c +++ b/qemu/hw/pci.c @@ -146,6 +146,41 @@ int pci_device_load(PCIDevice *s, QEMUFile *f) return 0; } +/* Search the bus for a multifunction device with a free function that + * matches vendor_id_filter and device_id_filter. -1 can be passed as + * a filter value to accept any id. + */ +int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter, +int device_id_filter) +{ +int devfn; + +for (devfn = bus->devfn_min; devfn < 256; devfn += 8) { + int vendor_id, device_id; + PCIDevice *pci_dev; + + if (!bus->devices[devfn]) + continue; + + pci_dev = bus->devices[devfn]; + vendor_id = pci_dev->config[0x01] << 8 | pci_dev->config[0x00]; + device_id = pci_dev->config[0x03] << 8 | pci_dev->config[0x02]; + + if ((vendor_id_filter == -1 || vendor_id_filter == vendor_id) && + (device_id_filter == -1 || device_id_filter == device_id) && + ((pci_dev->config[0x0e] & 0x80) == 0x80)) { + int i; + + for (i = 1; i < 8; i++) { + if (!bus->devices[devfn + i]) + return devfn + i; + } + } +} + +return -1; +} + /* -1 for devfn means auto assign */ PCIDevice *pci_register_device(PCIBus *bus, const char *name, int instance_size, int devfn, diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index 60e4094..84d6a29 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -33,7 +33,7 @@ typedef struct PCIIORegion { #define PCI_ROM_SLOT 6 #define PCI_NUM_REGIONS 7 -#define PCI_DEVICES_MAX 64 +#define PCI_DEVICES_MAX 256 #define PCI_VENDOR_ID 0x00/* 16 bits */ #define PCI_DEVICE_ID 0x02/* 16 bits */ @@ -105,6 +105,9 @@ void pci_info(void); PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint32_t id, pci_map_irq_fn map_irq, const char *name); +int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter, +int device_id_filter); + /* lsi53c895a.c */ #define LSI_MAX_DEVS 7 void lsi_scsi_attach(void *opaque, BlockDriverState *bd, int id); diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 6a50001..361455d 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -405,12 +405,22 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, PCIDevice *pci_dev; uint8_t *config; uint32_t size; +int devfn = -1; -pci_dev = pci_register_device(bus, name, struct_size, - -1, NULL, NULL); -if (!pci_dev) +pci_dev = pci_register_device(bus, name, struct_size, -1, NULL, NULL); + +if (pci_dev == NULL) { + devfn = pci_bus_find_device_function(bus, vendor, -1); + if (devfn != -1) + pci_dev = pci_register_device(bus, name, struct_size, + devfn, NULL, NULL); +} + +if (pci_dev == NULL) return NULL; +devfn = pci_dev->devfn; + vdev = to_virtio_device(pci_dev); vdev->status = 0; @@ -438,6 +448,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, config[0x3d] = 1; +/* Mark device as multi-function */ +if ((devfn % 8) == 0) + config[0x0e] |= 0x80; + vdev->name = name; vdev->config_len = config_size; if (vdev->config_len) diff --git a/qemu/net.h b/qemu/net.h index 13daa27..3bada75 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd); /* NIC info */ -#define MAX_NICS 8 +#define MAX_NICS 256 struct NICInfo { uint8_t macaddr[6]; diff --git a/qemu/sysemu.h b/qemu/sysemu.h index c60072d..4385802 100644 --- a/qemu/sysemu.h +++ b/qemu/sysemu.h @@ -149,7 +149,7 @@ typedef struct DriveInfo { #define MAX_IDE_DEVS 2 #define MAX_SCSI_DEVS 7 -#define MAX_DRIVES 32 +#define MAX_DRIVES 256 int nb_drives; DriveInfo drives_table[MAX_DRIVES+1]; diff --git a/qemu/vl.c b/qemu/vl.c index 74be059..824e331 100644 --- a/qemu/vl.c
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: > On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote: > > > The virtio config space was originally chosen to be little-endian, > > > because we thought the config might be part of the PCI config space > > > for virtio_pci. It's actually a separate mmio region, so that > > > argument holds little water; as only x86 is currently using the virtio > > > mechanism, we can change this (but must do so now, before the > > > impending s390 and ppc merges). > > > > This will probably annoy Hollis which has guests that can go both ways. > > Yes, I discussed this with Hollis. But the virtio rings themselves already > have this issue: we don't do any endian conversion on them and assume > they're "our" endian in the guest. > > We may still regret not doing *everything* little-endian, but this doesn't > make it worse. Hmm, why *don't* we just do everything LE, including the ring? -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: > Thanks for adding most of my enhancements. But > > 1. There is no real need for invalidate_page(). Can be done with > invalidate_start/end. Needlessly complicates the API. One > of the objections by Andrew was that there mere multiple > callbacks that perform similar functions. While I agree with that reading of Andrew's email about invalidate_page, I think the GRU hardware makes a strong enough case to justify the two seperate callouts. Due to the GRU hardware, we can assure that invalidate_page terminates all pending GRU faults (that includes faults that are just beginning) and can therefore be completed without needing any locking. The invalidate_page() callout gets turned into a GRU flush instruction and we return. Because the invalidate_range_start() leaves the page table information available, we can not use a single page _start to mimick that functionality. Therefore, there is a documented case justifying the seperate callouts. I agree the case is fairly weak, but it does exist. Given Andrea's unwillingness to move and Jack's documented case, it is my opinion the most likely compromise is to leave in the invalidate_page() callout. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, 22 Apr 2008, Robin Holt wrote: > putting it back into your patch/agreeing to it remaining in Andrea's > patch? If not, I think we can put this issue aside until Andrew gets > out of the merge window and can decide it. Either way, the patches > become much more similar with this in. One solution would be to separate the invalidate_page() callout into a patch at the very end that can be omitted. AFACIT There is no compelling reason to have this callback and it complicates the API for the device driver writers. Not having this callback makes the way that mmu notifiers are called from the VM uniform which is a desirable goal. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, 22 Apr 2008, Andrea Arcangeli wrote: > My patch order and API backward compatible extension over the patchset > is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support > XPMEM as well. KVM/GRU won't notice any difference once the support > for XPMEM is added, but even if the API would completely change in > 2.6.27, that's still better than no functionality at all in 2.6.26. Please redo the patchset with the right order. To my knowledge there is no chance of this getting merged for 2.6.26. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
Doing the right patch ordering would have avoided this patch and allow better review. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
Why are the subjects all screwed up? They are the first line of the description instead of the subject line of my patches. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
Missing signoff by you. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: 1. before vcpu->arch.mmu.page_fault() 2. after vcpu->arch.mmu.page_fault() 3. after mmu_topup_memory_caches() 4. after emulate_instruction() So the delta in the trace reports show: - cycles required for arch.mmu.page_fault (tracer 2) - cycles required for mmu_topup_memory_caches(tracer 3) - cycles required for emulate_instruction() (tracer 4) I captured trace data for ~5-seconds during one of the usual events (again this time it was due to kscand in the guest). I ran the formatted trace data through an awk script to summarize: TSC cycles tracer2 tracer3 tracer4 0 - 10,000: 295067213251115873 10,001 - 25,000: 7682 1004 98336 25,001 - 50,000: 2011536 50,001 - 100,000: 100655 010 > 100,000: 117 015 This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it took longer than 50,000 cycles. The page_fault function getting run is paging64_page_fault. mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times, most of them relatively quickly. Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few host processes could interrupt it. david Avi Kivity wrote: > David S. Ahern wrote: >> I added the traces and captured data over another apparent lockup of >> the guest. >> This seems to be representative of the sequence (pid/vcpu removed). >> >> (+4776) VMEXIT [ exitcode = 0x, rip = 0x >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db4 ] >> (+3632) VMENTRY >> (+4552) VMEXIT [ exitcode = 0x, rip = 0x >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x000b, virt = 0x >> fffb61c8 ] >> (+ 54928) VMENTRY >> > > Can you oprofile the host to see where the 54K cycles are spent? > >> (+4568) VMEXIT [ exitcode = 0x, rip = 0x >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x 9db4 gpte = 0x >> 41c5d363 ] >> (+8432) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x, rip = 0x >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x 9db0 gpte = 0x >> ] >> (+ 13832) VMENTRY >> >> >> (+5768) VMEXIT [ exitcode = 0x, rip = 0x >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db4 ] >> (+3712) VMENTRY >> (+4576) VMEXIT [ exitcode = 0x, rip = 0x >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x000b, virt = 0x >> fffb61d0 ] >> (+ 0) PTE_WRITE [ gpa = 0x 3d5981d0 gpte = 0x >> 3d55d047 ] >> > > This indeed has the accessed bit clear. > >> (+ 65216) VMENTRY >> (+4232) VMEXIT [ exitcode = 0x, rip = 0x >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x 9db4 gpte = 0x >> 3d598363 ] >> > > This has the accessed bit set and the user bit clear, and the pte > pointing at the previous pte_write gpa. Looks like a kmap_atomic(). > >> (+8640) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x, rip = 0x >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x 9db0 gpte = 0x >> ] >> (+ 14160) VMENTRY >> >> I can forward a more complete time snippet if you'd like. vcpu0 + >> corresponding >> vcpu1 files have 85000 total lines and compressed the files total ~500k. >> >> I did not see the FLOODED trace come out during this sample though I >> did bump >> the count from 3 to 4 as you suggested. >> >> >> > > Bumping the count was supposed to remove the flooding... > >> Correlating rip addresses to the 2.4 kernel: >> >> c0160d00-c0161290 = page_referenced >> >> It looks like the event is kscand running through the pages. I >> suspected this >> some time ago, and tried tweaking the kscand_work_percent sysctl >> variable. It >> appeared to lower the peak of the spikes, but maybe I imagined it. I >> believe >> lowering that value makes kscand wake up more often but do less work >> (page >> scanning) each time it is awakened. >> >> > > What does 'top' in the guest show (perhaps sorted by total cpu time > rather than instantaneous usage)? > > What host kernel are you running? How many host cpus? > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Do
Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
Looks like this is not complete. There are numerous .h files missing which means that various structs are undefined (fs.h and rmap.h are needed f.e.) which leads to surprises when dereferencing fields of these struct. It seems that mm_types.h is expected to be included only in certain contexts. Could you make sure to include all necessary .h files? Or add some docs to clarify the situation here. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Problems with MAC address with e1000 on Windows 2003
I was wondering if anyone could reproduce my problem. If it is reproduceable, then I'll file a bug. I am using e1000 ethernet adapters on Windows 2003 and Linux guests. The line to set it up is something like this: -net nic,vlan=1,macaddr=00:ff:21:cf:91:01,model=e1000 \ -net tap,vlan=1,ifname=tap.br1.91.1 On Linux, this works just fine. However, on Windows 2003, the mac address for the device is reported as 00:ff:ff:ff:ff:ff and the packets carry this mac address as well. The corresponding tap device has the correct IP address, however. This problem is definitely tied to using Windows 2003 with a e1000 device. If I use the rtl8139 device, Windows reports the correct mac address. When booting the same VM with a Linux bootable CD and the e1000 device, Linux reports the correct mac address as set in the qemu command. It's the combination of Windows 2003 and the e1000 device that causes the problem. Has anyone else seen this problem? Thanks in advance. -- Alberto Treviño [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Contact Mr Philip Williams
Hello my good friend. How are you today? Hope all is well with you and your family?, You may not understand why this mail came to you.But if you do not remember me, you might have receive an email from me in the past regarding a multi-million-dollar business proposal which we never concluded. I am using this opportunity to inform you that this multi-million-dollar business has been concluded with the assistance of another partner from India who financed the transaction to alogical conclusion. I thank you for your great effort to our unfinished transfer of fund into your account due to one reason or the other best known to you.But I want to informyou that I have successfully transferred the fund out of my bank to my new partner's account in India that was capable of assisting me in this great venture. Due to your effort, sincerity, courage and trustworthiness You showed during the course of the transaction.I want to compensate you and show my gratitude to you with the sum of $1,200,000.00. I haveleft a certified international bank cheque for youworth of $1,200,000.00 cashable anywhere in the world. My dear friend I will like you to contact my Account Officer Mr. Philip Williams, on his direct email address at:[EMAIL PROTECTED] for the collection of your bank cheque. I authorized him to release theBank Cheque to you whenever you contact him regardingthe cheque. At the moment, I'm very busy here because of the investment projects, which I and the new partner are having at hand.Please I will like you to accept this token with good faith as this is from the bottom of my heart,Also comply with Mr. Phillip's directives so that he will send the cheque to you without any delay. CONTACT: Mr. Philip Williams. Account Officer, Cotonou, Benin Republic, His email address: [EMAIL PROTECTED] Therefore, you should send him your full Name and telephone number/your correct mailing address where you want him to send the draft to you. Thanks and God bless you and your family. Hoping to hear from you. Mrs Fatima Ali - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: > > 1) invalidate_page: You retain an invalidate_page() callout. I believe > > we have progressed that discussion to the point that it requires some > > direction for Andrew, Linus, or somebody in authority. The basics > > of the difference distill down to no expected significant performance > > difference between the two. The invalidate_page() callout potentially > > can simplify GRU code. It does provide a more complex api for the > > users of mmu_notifier which, IIRC, Christoph had interpretted from one > > of Andrew's earlier comments as being undesirable. I vaguely recall > > that sentiment as having been expressed. > > invalidate_page as demonstrated in KVM pseudocode doesn't change the > locking requirements, and it has the benefit of reducing the window of > time the secondary page fault has to be masked and at the same time > _halves_ the number of _hooks_ in the VM every time the VM deal with > single pages (example: do_wp_page hot path). As long as we can't fully > converge because of point 3, it'd rather keep invalidate_page to be > better. But that's by far not a priority to keep. Christoph, Jack and I just discussed invalidate_page(). I don't think the point Andrew was making is that compelling in this circumstance. The code has change fairly remarkably. Would you have any objection to putting it back into your patch/agreeing to it remaining in Andrea's patch? If not, I think we can put this issue aside until Andrew gets out of the merge window and can decide it. Either way, the patches become much more similar with this in. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 02:26:45PM -0300, Marcelo Tosatti wrote: > On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: > > Avi Kivity wrote: > > >Anthony Liguori wrote: > > >> > > >>I think we need to decide what we want to target in terms of upper > > >>limits. > > >> > > >>With a bridge or two, we can probably easily do 128. > > >> > > >>If we really want to push things, I think we should do a PCI based > > >>virtio controller. I doubt a large number of PCI devices is ever > > >>going to perform very well b/c of interrupt sharing and some of the > > >>assumptions in virtio_pci. > >> > > >>If we implement a controller, we can use a single interrupt, but > > >>multiplex multiple notifications on that single interrupt. We can > > >>also be more aggressive about using shared memory instead of PCI > > >>config space which would reduce the overall number of exits. > > We should increase the number of interrupt lines, perhaps to 16. > > Using shared memory to avoid exits sounds very good idea. > > > >>We could easily support a very large number of devices this way. But > > >>again, what do we want to target for now? > > > > > >I think that for networking we should keep things as is. I don't see > > >anybody using 100 virtual NICs. > > The target was along the lines of 20 nics + 80 disks. Dan? I've already had people ask for ability to as many as 64 disks and 32 nics with Xen, so to my mind, the more we support the better. 100's if possible. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Marcelo Tosatti wrote: > On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: > >> Anthony Liguori wrote: >> >>> This patch changes virtio devices to be multi-function devices whenever >>> possible. This increases the number of virtio devices we can support now by >>> a factor of 8. >>> >>> With this patch, I've been able to launch a guest with either 220 disks or >>> 220 >>> network adapters. >>> >>> >>> >> Does this play well with hotplug? Perhaps we need to allocate a new >> device on hotplug. >> >> (certainly if we have a device with one function, which then gets >> converted to a multifunction device) >> > > Would have to change the hotplug code to handle functions... > BTW, I've never been that convinced that hotplugging devices is as useful as people make it out to be. I also think that's particularly true when it comes to hot adding/removing very large numbers of disks. I think if you created all virtio devices as multifunction devices, but didn't add additional functions until you ran out of PCI slots, it would be a pretty acceptable solution. Hotplug works just as it does today until you get much higher than 32 devices. Even then, hotplug still works with most of your devices (until you hit the absolute maximum number of devices of course). Regards, Anthony Liguori > It sounds less hacky to just extend the PCI slots instead of (ab)using > multiple functions per-slot. > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: > For mass storage, we should follow the SCSI model with a single device > serving multiple disks, similar to what you suggest. Not sure if the > device should have a single queue or one queue per disk. Don't you just end up re-implementing SCSI then, at which point you might as well stick with a 'fake' SCSI device in the guest? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
>>> For mass storage, we should follow the SCSI model with a single device >>> serving multiple disks, similar to what you suggest. Not sure if the >>> device should have a single queue or one queue per disk. >>> >> My latest thought it to do a virtio-based virtio controller. >> > > Why do you dislike multiple disks per virtio-blk controller? As > mentioned this seems a natural way forward. > Logically speaking, virtio is a bus. virtio supports all of the features of a bus (discover, hot add, hot remove). Right now, we map virtio devices directly onto the PCI bus. The problem we're trying to address is limitations of the PCI bus. We have a couple options: 1) add a virtio device that supports multiple disks. we need to reinvent hotplug within this device. 2) add a new PCI virtio transport that supports multiple virtio-blk devices within a single PCI slot 3) add a generic PCI virtio transport that supports multiple virtio devices within a single PCI slot 4) add a generic virtio "bridge" that supports multiple virtio devices within a single virtio device. #4 may seem strange, but it's no different from a PCI-to-PCI bridge. I like #4 the most, but #2 is probably the most practical. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: > 1) invalidate_page: You retain an invalidate_page() callout. I believe > we have progressed that discussion to the point that it requires some > direction for Andrew, Linus, or somebody in authority. The basics > of the difference distill down to no expected significant performance > difference between the two. The invalidate_page() callout potentially > can simplify GRU code. It does provide a more complex api for the > users of mmu_notifier which, IIRC, Christoph had interpretted from one > of Andrew's earlier comments as being undesirable. I vaguely recall > that sentiment as having been expressed. invalidate_page as demonstrated in KVM pseudocode doesn't change the locking requirements, and it has the benefit of reducing the window of time the secondary page fault has to be masked and at the same time _halves_ the number of _hooks_ in the VM every time the VM deal with single pages (example: do_wp_page hot path). As long as we can't fully converge because of point 3, it'd rather keep invalidate_page to be better. But that's by far not a priority to keep. > 2) Range callout names: Your range callouts are invalidate_range_start > and invalidate_range_end whereas Christoph's are start and end. I do not > believe this has been discussed in great detail. I know I have expressed > a preference for your names. I admit to having failed to follow up on > this issue. I certainly believe we could come to an agreement quickly > if pressed. I think using ->start ->end is a mistake, think when we later add mprotect_range_start/end. Here too I keep the better names only because we can't converge on point 3 (the API will eventually change, like every other kernel interal API, even core things like __free_page have been mostly obsoleted). > 3) The structure of the patch set: Christoph's upcoming release orders > the patches so the prerequisite patches are seperately reviewable > and each file is only touched by a single patch. Additionally, that Each file touched by a single patch? I doubt... The split is about the same, the main difference is the merge ordering, I always had the zero risk part at the head, he moved it at the tail when he incorporated #v12 into his patchset. > allows mmu_notifiers to be introduced as a single patch with sleeping > functionality from its inception and an API which remains unchanged. > Your patch set, however, introduces one API, then turns around and > changes that API. Again, the desire to make it an unchanging API was > expressed by, IIRC, Andrew. This does represent a risk to XPMEM as > the non-sleeping API may become entrenched and make acceptance of the > sleeping version less acceptable. > > Can we agree upon this list of issues? This is a kernel internal API, so it will definitely change over time. It's nothing close to a syscall. Also note: the API is obviously defined in mmu_notifier.h and none of the 2-12 patches touches mmu_notifier.h. So the extension of the method semantics is 100% backwards compatible. My patch order and API backward compatible extension over the patchset is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support XPMEM as well. KVM/GRU won't notice any difference once the support for XPMEM is added, but even if the API would completely change in 2.6.27, that's still better than no functionality at all in 2.6.26. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)
Thanks for all those who work on KVM. It is a wonderful product and I have been very impressed with its features, performance, and the level of activity in this project. Back in February a bug was filed. I've been hit by this bug as well, but there hasn't been much activity with it in the last little bit. I wanted to know if anyone had a fix for it, or a workaround (other than using IDE), or whether it was on someone's radar. Here is a link to the bug: http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831 Thanks in advance. -- Alberto Treviño [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
I believe the differences between your patch set and Christoph's need to be understood and a compromise approach agreed upon. Those differences, as I understand them, are: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. 2) Range callout names: Your range callouts are invalidate_range_start and invalidate_range_end whereas Christoph's are start and end. I do not believe this has been discussed in great detail. I know I have expressed a preference for your names. I admit to having failed to follow up on this issue. I certainly believe we could come to an agreement quickly if pressed. 3) The structure of the patch set: Christoph's upcoming release orders the patches so the prerequisite patches are seperately reviewable and each file is only touched by a single patch. Additionally, that allows mmu_notifiers to be introduced as a single patch with sleeping functionality from its inception and an API which remains unchanged. Your patch set, however, introduces one API, then turns around and changes that API. Again, the desire to make it an unchanging API was expressed by, IIRC, Andrew. This does represent a risk to XPMEM as the non-sleeping API may become entrenched and make acceptance of the sleeping version less acceptable. Can we agree upon this list of issues? Thank you, Robin Holt - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(
Gerd Hoffmann wrote: > Jeremy Fitzhardinge wrote: >> Xen could change the parameters in the instant after get_time_values(). >> That change could be as a result of suspend-resume, so the parameters >> and the tsc could be wildly different. > > Ah, ok, forgot the rdtsc in the picture. With that in mind I fully > agree that the loop is needed. I think kvm guests can even hit that one > with the vcpu migrating to a different physical cpu, so we better handle > it correctly ;) It's probably not needed for kvm, since we update everything everytime we get scheduled in the host side, which would cover the case for migration between physical cpus. But it's probably okay to do it to get a common denominator with xen, if needed. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 2/2] KVM: Handle interrupts for PCI passthrough devices
* On Sunday 13 Apr 2008 14:06:27 Avi Kivity wrote: > Amit Shah wrote: > > Passthrough devices are host machine PCI devices which have > > been handed off to the guest. Handle interrupts from these > > devices and route them to the appropriate guest irq lines. > > The userspace provides us with the necessary information > > via the ioctls. > > > > The guest IRQ numbers can change dynamically, so we have an > > additional ioctl that keeps track of those changes in userspace > > and notifies us whenever that happens. > > > > It is expected the kernel driver for the passthrough device > > is removed before passing it on to the guest. > > > > > > +/* > > + * Used to find a registered host PCI device (a "passthrough" device) > > + * during interrupts or EOI > > + */ > > +static struct kvm_pci_pt_dev_list * > > +find_pci_pt_dev(struct list_head *head, > > + struct kvm_pci_pt_info *pv_pci_info, int irq, int source) > > +{ > > + struct list_head *ptr; > > + struct kvm_pci_pt_dev_list *match; > > + > > + list_for_each(ptr, head) { > > + match = list_entry(ptr, struct kvm_pci_pt_dev_list, list); > > + > > + switch (source) { > > + case KVM_PT_SOURCE_IRQ: > > + /* > > +* Used to find a registered host device > > +* during interrupt context on host > > +*/ > > + if (match->pt_dev.host.irq == irq) > > + return match; > > + break; > > + case KVM_PT_SOURCE_IRQ_ACK: > > + /* > > +* Used to find a registered host device when > > +* the guest acks an interrupt > > +*/ > > + if (match->pt_dev.guest.irq == irq) > > + return match; > > + break; > > + } > > + } > > + return NULL; > > +} > > This would be better as two separate functions. Also, locking? For pvdma, there will be two more cases. Very similar functions for essentially looking up an entry in the same list. Locking will be supported soon. > > +static irqreturn_t > > +kvm_pci_pt_dev_intr(int irq, void *dev_id) > > Please don't split declarations unnecessarily. Fixed. > > +{ > > + struct kvm_pci_pt_dev_list *match; > > + struct kvm *kvm = (struct kvm *) dev_id; > > + > > + if (!test_bit(irq, pt_irq_handled)) > > + return IRQ_NONE; > > + > > + if (test_bit(irq, pt_irq_pending)) > > + return IRQ_HANDLED; > > Will the interrupt not fire immediately after this returns? Hmm. This is just an optimisation so that we don't have to look up the list each time to find out which assigned device it is and (re)injecting the interrupt. Also we avoid the (TODO) getting/releasing locks which will be needed for the list lookup. Disabling interrupts for PCI devices isn't a good idea even if we don't support shared interrupts. Any other ideas to avoid this from happening? > > + match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, > > + irq, KVM_PT_SOURCE_IRQ); > > + if (!match) > > + return IRQ_NONE; > > + > > + /* Not possible to detect if the guest uses the PIC or the > > +* IOAPIC. So set the bit in both. The guest will ignore > > +* writes to the unused one. > > +*/ > > + kvm_ioapic_set_irq(kvm->arch.vioapic, match->pt_dev.guest.irq, 1); > > + kvm_pic_set_irq(pic_irqchip(kvm), match->pt_dev.guest.irq, 1); > > A function that calls both the apic and the pic is better, as it will be > easier to port. Done. > > + set_bit(irq, pt_irq_pending); > > + return IRQ_HANDLED; > > +} > > + > > +/* Ack the irq line for a passthrough device */ > > +void > > +kvm_pci_pt_ack_irq(struct kvm *kvm, int vector) > > +{ > > + int irq; > > + struct kvm_pci_pt_dev_list *match; > > + > > + irq = get_eoi_gsi(kvm->arch.vioapic, vector); > > + match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, > > + irq, KVM_PT_SOURCE_IRQ_ACK); > > + if (!match) > > + return; > > + if (test_bit(match->pt_dev.host.irq, pt_irq_pending)) { > > + kvm_ioapic_set_irq(kvm->arch.vioapic, irq, 0); > > + kvm_pic_set_irq(pic_irqchip(kvm), irq, 0); > > This is dangerous with smp guests, if we aren't careful with the > ordering the interrupt may fire again and be forwarded to the other > vcpu. We need to call this before we redeliver interrupts. The 'pending' bitmap ensures we don't inject an interrupt that hasn't been ack'ed. Once the locking is in place, this shouldn't be a worry. > > + clear_bit(match->pt_dev.host.irq, pt_irq_pending); > > + } > > +} ... > > @@ -1671,6 +1836,30 @@ long kvm_arch_vm_ioctl(struct file *filp, > > r = 0; > > break; > > } > > + case KVM_ASSIGN_PCI_PT_DEV: { > > + struct kvm_pci_passthrough_dev pci_pt_dev; > > + > > + r
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: > Avi Kivity wrote: > >Anthony Liguori wrote: > >> > >>I think we need to decide what we want to target in terms of upper > >>limits. > >> > >>With a bridge or two, we can probably easily do 128. > >> > >>If we really want to push things, I think we should do a PCI based > >>virtio controller. I doubt a large number of PCI devices is ever > >>going to perform very well b/c of interrupt sharing and some of the > >>assumptions in virtio_pci. >> > >>If we implement a controller, we can use a single interrupt, but > >>multiplex multiple notifications on that single interrupt. We can > >>also be more aggressive about using shared memory instead of PCI > >>config space which would reduce the overall number of exits. We should increase the number of interrupt lines, perhaps to 16. Using shared memory to avoid exits sounds very good idea. > >>We could easily support a very large number of devices this way. But > >>again, what do we want to target for now? > > > >I think that for networking we should keep things as is. I don't see > >anybody using 100 virtual NICs. The target was along the lines of 20 nics + 80 disks. Dan? > >For mass storage, we should follow the SCSI model with a single device > >serving multiple disks, similar to what you suggest. Not sure if the > >device should have a single queue or one queue per disk. > > My latest thought it to do a virtio-based virtio controller. Why do you dislike multiple disks per virtio-blk controller? As mentioned this seems a natural way forward. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : > + > +static int mm_lock_cmp(const void *a, const void *b) > +{ > + cond_resched(); > + if ((unsigned long)*(spinlock_t **)a < > + (unsigned long)*(spinlock_t **)b) > + return -1; > + else if (a == b) > + return 0; > + else > + return 1; > +} > + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la < lb) return -1; if (la > lb) return 1; return 0; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Odd hang in the Ubuntu installer
Hi guys. I'm trying to figure out what's going on with this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217815 The short version of the problem is that it seems that if the console is left alone for an extended period of time, everything seems to stall until something (moving the mouse around, pressing a key, whatever) awakens it again. It usually shows itself when you choose the "Encrypted LVM" option in our installer (this process wipes the drive, which is a rather lenghty process), since that's probably the only place where you'd leave the console alone for a while, while still getting some UI feedback (and suddenly lack of feedback, obviously). It started when I backported this to the kvm version in our archive: commit d2668b3fd41f88c18a7f9c4f1d024f0e5d9f64cf Author: Marcelo Tosatti <[EMAIL PROTECTED]> Date: Wed Apr 2 20:20:14 2008 -0300 Subject: kvm: qemu: separate thread for IO handling While trying to solve this problem, I noticed that that commit was just one of a set of three patches. Applying those two: commit 1743ef816b6cd22d100ccb80e542b8ca19c75392 Author: Marcelo Tosatti <[EMAIL PROTECTED]> Date: Wed Apr 2 20:20:15 2008 -0300 Subject: kvm: qemu: add function to handle signals commit d84f71afaafec49e0ab3aa7a33518df04c14f38a Author: Marcelo Tosatti <[EMAIL PROTECTED]> Date: Wed Apr 2 20:20:16 2008 -0300 Subject: kvm: qemu: notify IO thread of pending bhs ...makes it take a bit longer before it happens, but it's still very much reproducable. Reverting those changes fixes it completely. We've tried with kvm 66, which also exhibits this behaviour, so I'm fairly confident I didn't mess up the patch while backporting it. In case you're interested, the backported patch is here: http://people.ubuntu.com/~soren/virtio_hang.patch The latter two commits applied without changes (with a bit of fuzz, though). I'm hoping one of you guys could give me a hint (or perhaps even a patch)? -- Soren Hansen | Virtualisation specialist | Ubuntu Server Team Canonical Ltd. | http://www.ubuntu.com/ signature.asc Description: Digital signature - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote: > I am saying your intent was probably to test > > else if ((unsigned long)*(spinlock_t **)a == > (unsigned long)*(spinlock_t **)b) > return 0; Indeed... > Hum, it's not a micro-optimization, but a bug fix. :) The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need to dereference them. >> static int mm_lock_cmp(const void *a, const void *b) >> { >> unsigned long la = (unsigned long)*(spinlock_t **)a; >> unsigned long lb = (unsigned long)*(spinlock_t **)b; >> >> cond_resched(); >> if (la < lb) >> return -1; >> if (la > lb) >> return 1; >> return 0; >> } >> > > If your intent is to use the assumption that there are going to be few > equal entries, you should have used likely(la > lb) to signal it's > rarely going to return zero or gcc is likely free to do whatever it > wants with the above. Overall that function is such a slow path that > this is going to be lost in the noise. My suggestion would be to defer > microoptimizations like this after 1/12 will be applied to mainline. > > Thanks! > > Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: > Anthony Liguori wrote: >> >> I think we need to decide what we want to target in terms of upper >> limits. >> >> With a bridge or two, we can probably easily do 128. >> >> If we really want to push things, I think we should do a PCI based >> virtio controller. I doubt a large number of PCI devices is ever >> going to perform very well b/c of interrupt sharing and some of the >> assumptions in virtio_pci. >> >> If we implement a controller, we can use a single interrupt, but >> multiplex multiple notifications on that single interrupt. We can >> also be more aggressive about using shared memory instead of PCI >> config space which would reduce the overall number of exits. >> >> We could easily support a very large number of devices this way. But >> again, what do we want to target for now? > > I think that for networking we should keep things as is. I don't see > anybody using 100 virtual NICs. > > For mass storage, we should follow the SCSI model with a single device > serving multiple disks, similar to what you suggest. Not sure if the > device should have a single queue or one queue per disk. My latest thought it to do a virtio-based virtio controller. We could avoid creating one in QEMU unless we detect an abnormally large number of disks or something. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: > > I think we need to decide what we want to target in terms of upper > limits. > > With a bridge or two, we can probably easily do 128. > > If we really want to push things, I think we should do a PCI based > virtio controller. I doubt a large number of PCI devices is ever > going to perform very well b/c of interrupt sharing and some of the > assumptions in virtio_pci. > > If we implement a controller, we can use a single interrupt, but > multiplex multiple notifications on that single interrupt. We can > also be more aggressive about using shared memory instead of PCI > config space which would reduce the overall number of exits. > > We could easily support a very large number of devices this way. But > again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Nguyen Anh Quynh wrote: > Hi, > > I am thinking about comibing this ROM with the extboot. Both two ROM > are about "booting", so I think that is reasonable. So we will have > only 1 ROM that supports both external boot and Linux boot. > > Is that desirable or not? > Does it make the code simpler and easier to understand? If not, then I would say no. -hpa - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 06:22:48 Avi Kivity wrote: > Rusty Russell wrote: > > [Christian, Hollis, how much is this ABI breakage going to hurt you?] > > > > A recent proposed feature addition to the virtio block driver revealed > > some flaws in the API, in particular how easy it is to break big > > endian machines. > > > > The virtio config space was originally chosen to be little-endian, > > because we thought the config might be part of the PCI config space > > for virtio_pci. It's actually a separate mmio region, so that > > argument holds little water; as only x86 is currently using the virtio > > mechanism, we can change this (but must do so now, before the > > impending s390 and ppc merges). > > This will probably annoy Hollis which has guests that can go both ways. Rusty and I have discussed it. Ultimately, this just takes us from a cross-architecture endianness definition to a per-architecture definition. Anyways, we've already fallen into this situation with the virtio ring data itself, so we're really saying "same endianness as the ring". -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Marcelo Tosatti wrote: >> Maybe require explicit device/function assignment on the command line? >> It will be managed anyway. >> > > ACPI does support hotplugging of individual functions inside slots, > not sure how well does Linux (and other OSes) support that.. should be > transparent though. > I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We could easily support a very large number of devices this way. But again, what do we want to target for now? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity <[EMAIL PROTECTED]> wrote: > I'm rooting for btrfs myself. but could btrfs (when stable) work for migration? i'm curious about OCFS2 performance on this kind of load... when i manage to sell the idea of a KVM cluster i'd like to know if i should try first EVMS-HA (cluster LV's) or OCFS (cluster FS) -- Javier - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: > >And video streaming on some embedded devices with no MMU! (Due to the > >page cache heuristics working poorly with no MMU, sustained reliable > >streaming is managed with O_DIRECT and the app managing cache itself > >(like a database), and that needs AIO to keep the request queue busy. > >At least, that's the theory.) > > Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. > It's me at fault here. I just assumed that because it's easy to do aio > in a thread pool efficiently, that's what glibc does. > > Unfortunately the code does some ridiculous things like not service > multiple requests on a single fd in parallel. I see absolutely no > reason for it (the code says "fight for resources"). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? > So my comments only apply to linux-aio vs a sane thread pool. Sorry for > spreading confusion. Thanks. I thought you'd measured it :-) > It could and should. It probably doesn't. > > A simple thread pool implementation could come within 10% of Linux aio > for most workloads. It will never be "exactly", but for small numbers > of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. > >Hmm. Thanks. I may consider switching to XFS now > > I'm rooting for btrfs myself. In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll be happy to give it a try! :-) -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: > >Perhaps. This raises another point about AIO vs. threads: > > > >If I submit sequential O_DIRECT reads with aio_read(), will they enter > >the device read queue in the same order, and reach the disk in that > >order (allowing for reordering when worthwhile by the elevator)? > > Yes, unless the implementation in the kernel (or glibc) is threaded. > > >With threads this isn't guaranteed and scheduling makes it quite > >likely to issue the parallel synchronous reads out of order, and for > >them to reach the disk out of order because the elevator doesn't see > >them simultaneously. > > If the disk is busy, it doesn't matter. The requests will queue and the > elevator will sort them out. So it's just the first few requests that > may get to disk out of order. There's two cases where it matters to a read-streaming app: 1. Disk isn't busy with anything else, maximum streaming performance is desired. 2. Disk is busy with unrelated things, but you're using I/O priorities to give the streaming app near-absolute priority. Then you need to maintain overlapped streaming requests, otherwise disk is given to a lower priority I/O. If that happens often, you lose, priority is ineffective. Because one of the streaming requests is usually being serviced, elevator has similar limitations as for a disk which is not busy with anything else. > I haven't considered tape, but this is a good point indeed. I expect it > doesn't make much of a difference for a loaded disk. Yes, as long as it's loaded with unrelated requests at the same I/O priority, the elevator has time to sort requests and hide thread scheduling artifacts. Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's elevator, then submits them to the host's elevator. If the guest and host elevators are both configured 'anticipatory', do the anticipatory delays add up? -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 05:51:51PM +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Avi Kivity wrote: > >> Anthony Liguori wrote: > >> > >>> This patch changes virtio devices to be multi-function devices whenever > >>> possible. This increases the number of virtio devices we can > >>> support now by > >>> a factor of 8. > >>> > >>> With this patch, I've been able to launch a guest with either 220 > >>> disks or 220 > >>> network adapters. > >>> > >>> > >> > >> Does this play well with hotplug? Perhaps we need to allocate a new > >> device on hotplug. > >> > > > > Probably not. I imagine you can only hotplug devices, not individual > > functions? > > > > It sounds reasonable to expect so. ACPI has objects for devices, not > functions (IIRC). So what I dislike about multifunction devices is the fact that a single slot shares an IRQ, and that special code is required in the QEMU drivers (virtio guest capability might not always be present). I don't see any need for using them if we can extend PCI slots... > Maybe require explicit device/function assignment on the command line? > It will be managed anyway. ACPI does support hotplugging of individual functions inside slots, not sure how well does Linux (and other OSes) support that.. should be transparent though. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
Andrew, Could we get direction/guidance from you as regards the invalidate_page() callout of Andrea's patch set versus the invalidate_range_start/invalidate_range_end callout pairs of Christoph's patchset? This is only in the context of the __xip_unmap, do_wp_page, page_mkclean_one, and try_to_unmap_one call sites. On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: > > I am a little confused about the value of the seq_lock versus a simple > > atomic, but I assumed there is a reason and left it at that. > > There's no value for anything but get_user_pages (get_user_pages takes > its own lock internally though). I preferred to explain it as a > seqlock because it was simpler for reading, but I totally agree in the > final implementation it shouldn't be a seqlock. My code was meant to > be pseudo-code only. It doesn't even need to be atomic ;). Unless there is additional locking in your fault path, I think it does need to be atomic. > > I don't know what you mean by "it'd" run slower and what you mean by > > "armed and disarmed". > > 1) when armed the time-window where the kvm-page-fault would be > blocked would be a bit larger without invalidate_page for no good > reason But that is a distinction without a difference. In the _start/_end case, kvm's fault handler will not have any _DIRECT_ blocking, but get_user_pages() had certainly better block waiting for some other lock to prevent the process's pages being refaulted. I am no VM expert, but that seems like it is critical to having a consistent virtual address space. Effectively, you have a delay on the kvm fault handler beginning when either invalidate_page() is entered or invalidate_range_start() is entered until when the _CALLER_ of the invalidate* method has unlocked. That time will remain essentailly identical for either case. I would argue you would be hard pressed to even measure the difference. > 2) if you were to remove invalidate_page when disarmed the VM could > would need two branches instead of one in various places Those branches are conditional upon there being list entries. That check should be extremely cheap. The vast majority of cases will have no registered notifiers. The second check for the _end callout will be from cpu cache. > I don't want to waste cycles if not wasting them improves performance > both when armed and disarmed. In summary, I think we have narrowed down the case of no registered notifiers to being infinitesimal. The case of registered notifiers being a distinction without a difference. > > When I was discussing this difference with Jack, he reminded me that > > the GRU, due to its hardware, does not have any race issues with the > > invalidate_page callout simply doing the tlb shootdown and not modifying > > any of its internal structures. He then put a caveat on the discussion > > that _either_ method was acceptable as far as he was concerned. The real > > issue is getting a patch in that satisfies all needs and not whether > > there is a seperate invalidate_page callout. > > Sure, we have that patch now, I'll send it out in a minute, I was just > trying to explain why it makes sense to have an invalidate_page too > (which remains the only difference by now), removing it would be a > regression on all sides, even if a minor one. I think GRU is the only compelling case I have heard for having the invalidate_page seperate. In the case of the GRU, the hardware enforces a lifetime of the invalidate which covers all in-progress faults including ones where the hardware is informed after the flush of a PTE. in all cases, once the GRU invalidate instruction is issued, all active requests are invalidated. Future faults will be blocked in get_user_pages(). Without that special feature of the hardware, I don't think any code simplification exists. I, of course, reserve the right to be wrong. I believe the argument against a seperate invalidate_page() callout was Christoph's interpretation of Andrew's comments. I am not certain Andrew was aware of this special aspects of the GRU hardware and whether that had been factored into the discussion at that point in time. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > > You need to compare *a to *b (at least, that's what you're doing for the < case). -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: > Anthony Liguori wrote: > >>If I submit sequential O_DIRECT reads with aio_read(), will they enter > >>the device read queue in the same order, and reach the disk in that > >>order (allowing for reordering when worthwhile by the elevator)? > >> > >There's no guarantee that any sort of order will be preserved by AIO > >requests. The same is true with writes. This is what fdsync is for, > >to guarantee ordering. > > I believe he'd like a hint to get good scheduling, not a guarantee. > With a thread pool if the threads are scheduled out of order, so are > your requests. > If the elevator doesn't plug the queue, the first few requests may > not be optimally sorted. That's right. Then they tend to settle to a good order. But any delay in scheduling one of the threads, or a signal received by one of them, can make it lose order briefly, making the streaming stutter as the disk performes a few local seeks until it settles to good order again. You can mitigate the disruption in various ways. 1. If all threads share an "offset" variable, and reads and increments that atomically just prior to calling pread(), that helps especially at the start. (If threaded I/O is used for QEMU disk emulation, I would suggest doing that, in the more general form of popping a request from QEMU's internal shared queue at the last moment.) 2. Using more threads helps keep it sustained, at the cost of more wasted I/O when there's a cancellation (changed mind), and more memory. However, AIO, in principle (if not implementations...) could be better at keeping the suggested I/O order than thread, without special tricks. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > This patch changes virtio devices to be multi-function devices whenever > > possible. This increases the number of virtio devices we can support now by > > a factor of 8. > > > > With this patch, I've been able to launch a guest with either 220 disks or > > 220 > > network adapters. > > > > > > Does this play well with hotplug? Perhaps we need to allocate a new > device on hotplug. > > (certainly if we have a device with one function, which then gets > converted to a multifunction device) Would have to change the hotplug code to handle functions... It sounds less hacky to just extend the PCI slots instead of (ab)using multiple functions per-slot. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Ryan Harper wrote: > * Anthony Liguori <[EMAIL PROTECTED]> [2008-04-22 09:16]: > >> This patch changes virtio devices to be multi-function devices whenever >> possible. This increases the number of virtio devices we can support now by >> a factor of 8. >> >> With this patch, I've been able to launch a guest with either 220 disks or >> 220 >> network adapters. >> > > Have you confirmed that the network devices show up? I was playing > around with some of the limits last night and while it is easy to get > QEMU to create the adapters, so far I've only had a guest see 29 pci > nics (e1000). > Yup, I had an eth219 Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > Andrea Arcangeli a écrit : >> + >> +static int mm_lock_cmp(const void *a, const void *b) >> +{ >> +cond_resched(); >> +if ((unsigned long)*(spinlock_t **)a < >> +(unsigned long)*(spinlock_t **)b) >> +return -1; >> +else if (a == b) >> +return 0; >> +else >> +return 1; >> +} >> + > This compare function looks unusual... > It should work, but sort() could be faster if the > if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > static int mm_lock_cmp(const void *a, const void *b) > { > unsigned long la = (unsigned long)*(spinlock_t **)a; > unsigned long lb = (unsigned long)*(spinlock_t **)b; > > cond_resched(); > if (la < lb) > return -1; > if (la > lb) > return 1; > return 0; > } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la > lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
* Anthony Liguori <[EMAIL PROTECTED]> [2008-04-22 09:16]: > This patch changes virtio devices to be multi-function devices whenever > possible. This increases the number of virtio devices we can support now by > a factor of 8. > > With this patch, I've been able to launch a guest with either 220 disks or 220 > network adapters. Have you confirmed that the network devices show up? I was playing around with some of the limits last night and while it is easy to get QEMU to create the adapters, so far I've only had a guest see 29 pci nics (e1000). -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: > >Perhaps. This raises another point about AIO vs. threads: > > > >If I submit sequential O_DIRECT reads with aio_read(), will they enter > >the device read queue in the same order, and reach the disk in that > >order (allowing for reordering when worthwhile by the elevator)? > > There's no guarantee that any sort of order will be preserved by AIO > requests. The same is true with writes. This is what fdsync is for, to > guarantee ordering. You misunderstand. I'm not talking about guarantees, I'm talking about expectations for the performance effect. Basically, to do performant streaming read with O_DIRECT you need two things: 1. Overlap at least 2 requests, so the device is kept busy. 2. Requests be sent to the disk in a good order, which is usually (but not always) sequential offset order. The kernel does this itself with buffered reads, doing readahead. It works very well, unless you have other problems caused by readahead. With O_DIRECT, an application has to do the equivalent of readahead itself to get performant streaming. If the app uses two threads calling pread(), it's hard to ensure the kernel even _sees_ the first two calls in sequential offset order. You spawn two threads, and then both threads call pread() with non-deterministic scheduling. The problem starts before even entering the kernel. Then, depending on I/O scheduling in the kernel, it might send the less good pread() to the disk immediately, then later a backward head seek and the other one. The elevator cannot fix this: it doesn't have enough information, unless it adds artificial delays. But artificial delays may harm too; it's not optimal. After that, the two threads tend to call pread() in the best order provided there's no scheduling conflicts, but are easily disrupted by other tasks, especially on SMP (one reading thread per CPU, so when one of them is descheduled, the other continues and issues a request in the 'wrong' order.) With AIO, even though you can't be sure what the kernel does, you can be sure the kernel receives aio_read() calls in the exact order which is most likely to perform well. Application knowledge of it's access pattern is passed along better. As I've said, I saw a man page which described why this makes AIO superior to using threads for reading tapes on that OS. So it's not a completely spurious point. This has nothing to do with guarantees. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 4:15 PM, Anthony Liguori <[EMAIL PROTECTED]> wrote: > This patch changes virtio devices to be multi-function devices whenever > possible. This increases the number of virtio devices we can support now by > a factor of 8. [...] > diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c > index 9100bb1..9ea14d3 100644 > --- a/qemu/hw/virtio.c > +++ b/qemu/hw/virtio.c > @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char > *name, > PCIDevice *pci_dev; > uint8_t *config; > uint32_t size; > +static int devfn = 7; > + > +if ((devfn % 8) == 7) > + devfn = -1; > +else > + devfn++; This code look strange... devfn should be passed to virtio_init_pci by virtio-{net,blk} init functions, no? Luca - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: >> >> If I submit sequential O_DIRECT reads with aio_read(), will they enter >> the device read queue in the same order, and reach the disk in that >> order (allowing for reordering when worthwhile by the elevator)? >> > > There's no guarantee that any sort of order will be preserved by AIO > requests. The same is true with writes. This is what fdsync is for, > to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Nguyen Anh Quynh wrote: > Hi, > > This should be submitted to upstream (but not to kvm-devel list), but > this is only the test code that I want to quickly send out for > comments. In case it looks OK, I will send it to upstream later. > > Inspired by extboot and conversations with Anthony and HPA, this > linuxboot option ROM is a simple option ROM that intercepts int19 in > order to execute linux setup code. This approach eliminates the need > to manipulate the boot sector for this purpose. > > To test it, just load linux kernel with your KVM/QEMU image using > -kernel option in normal way. > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > Ubuntu 8.04. > For the next rounds, could you actually rebase against upstream QEMU and submit to qemu-devel? One of Paul Brook's objections to extboot had historically been that it wasn't not easily sharable with other architectures. With a C version, it seems more reasonable now to do that. Make sure you remove all the old linux boot code too within QEMU along with the -hda checks. Regards, Anthony Liguori > Thanks, > Quynh > > > # diffstat linuxboot1.diff > Makefile | 13 - > linuxboot/Makefile | 40 +++ > linuxboot/boot.S | 54 + > linuxboot/farvar.h | 130 > +++ > linuxboot/rom.c | 104 > linuxboot/signrom|binary > linuxboot/signrom.c | 128 > ++ > linuxboot/util.h | 69 +++ > qemu/Makefile|3 - > qemu/Makefile.target |2 > qemu/hw/linuxboot.c | 39 +++ > qemu/hw/pc.c | 22 +++- > qemu/hw/pc.h |5 + > 13 files changed, 600 insertions(+), 9 deletions(-) > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: > Avi Kivity wrote: > >>> And video streaming on some embedded devices with no MMU! (Due to the >>> page cache heuristics working poorly with no MMU, sustained reliable >>> streaming is managed with O_DIRECT and the app managing cache itself >>> (like a database), and that needs AIO to keep the request queue busy. >>> At least, that's the theory.) >>> >> Could use threads as well, no? >> > > Perhaps. This raises another point about AIO vs. threads: > > If I submit sequential O_DIRECT reads with aio_read(), will they enter > the device read queue in the same order, and reach the disk in that > order (allowing for reordering when worthwhile by the elevator)? > Yes, unless the implementation in the kernel (or glibc) is threaded. > With threads this isn't guaranteed and scheduling makes it quite > likely to issue the parallel synchronous reads out of order, and for > them to reach the disk out of order because the elevator doesn't see > them simultaneously. > If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. > With AIO (non-Glibc! (and non-kthreads)) it might be better at > keeping the intended issue order, I'm not sure. > > It is highly desirable: O_DIRECT streaming performance depends on > avoiding seeks (no reordering) and on keeping the request queue > non-empty (no gap). > > I read a man page for some other unix, describing AIO as better than > threaded parallel reads for reading tape drives because of this (tape > seeks are very expensive). But the rest of the man page didn't say > anything more. Unfortunately I don't remember where I read it. I > have no idea whether AIO submission order is nearly always preserved > in general, or expected to be. > I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. > >> It's me at fault here. I just assumed that because it's easy to do aio >> in a thread pool efficiently, that's what glibc does. >> >> Unfortunately the code does some ridiculous things like not service >> multiple requests on a single fd in parallel. I see absolutely no >> reason for it (the code says "fight for resources"). >> > > Ouch. Perhaps that relates to my thought above, about multiple > requests to the same file causing seek storms when thread scheduling > is unlucky? > My first thought on seeing this is that it relates to a deficiency on older kernels servicing multiple requests on a single fd (i.e. a per-file lock). I don't know if such a deficiency ever existed, though. > >> It could and should. It probably doesn't. >> >> A simple thread pool implementation could come within 10% of Linux aio >> for most workloads. It will never be "exactly", but for small numbers >> of disks, close enough. >> > > I would wait for benchmark results for I/O patterns like sequential > reading and writing, because of potential for seeks caused by request > reordering, before being confident of that. > > I did have measurements (and a test rig) at a previous job (where I did a lot of I/O work); IIRC the performance of a tuned thread pool was not far behind aio, both for seeks and sequential. It was a while back though. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting
Le mardi 22 avril 2008 à 08:50 -0500, Anthony Liguori a écrit : > Nguyen Anh Quynh wrote: > > Hi, > > > > This should be submitted to upstream (but not to kvm-devel list), but > > this is only the test code that I want to quickly send out for > > comments. In case it looks OK, I will send it to upstream later. > > > > Inspired by extboot and conversations with Anthony and HPA, this > > linuxboot option ROM is a simple option ROM that intercepts int19 in > > order to execute linux setup code. This approach eliminates the need > > to manipulate the boot sector for this purpose. > > > > To test it, just load linux kernel with your KVM/QEMU image using > > -kernel option in normal way. > > > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > > Ubuntu 8.04. > > > > For the next rounds, could you actually rebase against upstream QEMU and > submit to qemu-devel? One of Paul Brook's objections to extboot had > historically been that it wasn't not easily sharable with other > architectures. With a C version, it seems more reasonable now to do that. Moreover add a binary version of the ROM in the pc-bios directory: it avoids to have a cross-compiler to build ROM on non-x86 architecture. Regards, Laurent > Make sure you remove all the old linux boot code too within QEMU along > with the -hda checks. > > Regards, > > Anthony Liguori > > > Thanks, > > Quynh > > > > > > # diffstat linuxboot1.diff > > Makefile | 13 - > > linuxboot/Makefile | 40 +++ > > linuxboot/boot.S | 54 + > > linuxboot/farvar.h | 130 > > +++ > > linuxboot/rom.c | 104 > > linuxboot/signrom|binary > > linuxboot/signrom.c | 128 > > ++ > > linuxboot/util.h | 69 +++ > > qemu/Makefile|3 - > > qemu/Makefile.target |2 > > qemu/hw/linuxboot.c | 39 +++ > > qemu/hw/pc.c | 22 +++- > > qemu/hw/pc.h |5 + > > 13 files changed, 600 insertions(+), 9 deletions(-) > > > > > > -- - [EMAIL PROTECTED] --- "The best way to predict the future is to invent it." - Alan Kay - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: > Avi Kivity wrote: > >>> And video streaming on some embedded devices with no MMU! (Due to the >>> page cache heuristics working poorly with no MMU, sustained reliable >>> streaming is managed with O_DIRECT and the app managing cache itself >>> (like a database), and that needs AIO to keep the request queue busy. >>> At least, that's the theory.) >>> >> Could use threads as well, no? >> > > Perhaps. This raises another point about AIO vs. threads: > > If I submit sequential O_DIRECT reads with aio_read(), will they enter > the device read queue in the same order, and reach the disk in that > order (allowing for reordering when worthwhile by the elevator)? > There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: > Avi Kivity wrote: >> Anthony Liguori wrote: >> >>> This patch changes virtio devices to be multi-function devices whenever >>> possible. This increases the number of virtio devices we can >>> support now by >>> a factor of 8. >>> >>> With this patch, I've been able to launch a guest with either 220 >>> disks or 220 >>> network adapters. >>> >>> >> >> Does this play well with hotplug? Perhaps we need to allocate a new >> device on hotplug. >> > > Probably not. I imagine you can only hotplug devices, not individual > functions? > It sounds reasonable to expect so. ACPI has objects for devices, not functions (IIRC). Maybe require explicit device/function assignment on the command line? It will be managed anyway. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: > Anthony Liguori wrote: > >> This patch changes virtio devices to be multi-function devices whenever >> possible. This increases the number of virtio devices we can support now by >> a factor of 8. >> >> With this patch, I've been able to launch a guest with either 220 disks or >> 220 >> network adapters. >> >> >> > > Does this play well with hotplug? Perhaps we need to allocate a new > device on hotplug. > Probably not. I imagine you can only hotplug devices, not individual functions? Regards, Anthony Liguori > (certainly if we have a device with one function, which then gets > converted to a multifunction device) > > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM: PIT: make last_injected_time per-guest
Marcelo Tosatti wrote: > Otherwise multiple guests use the same variable and boom. > > Also use kvm_vcpu_kick() to make sure that if a timer triggers on > a different CPU the event won't be missed. > > Applied, thanks. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 17:44:08 Christian Borntraeger wrote: > Am Dienstag, 22. April 2008 schrieb Rusty Russell: > > [Christian, Hollis, how much is this ABI breakage going to hurt you?] > > It is ok for s390 at the moment. We are still working on making userspace > ready and I plan to change the guest<->host for s390 anyway. I try to make > these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main > reason is, that we are currently limited to around 80 devices. I am not > sure, if I should change the allocation of the virtqueues and descriptors > to guest memory as well. Large rings require contiguous memory, which makes guest allocation problematic. 512 elems at 4k pages == 5 pages. > Back to your patch: > I have still some ideas about virtio between little endian and big endian > systems, but it requires more and different marshalling anyway - even on > driver level. No idea yet how to solve that properly. So far we've pushed such considerations onto the host. This does mean that you can't virtio connect two guests directly without understanding the contents of the buffers so you can endian correct (eg. direct inter-guest networking). inter-guest virtio is currently a party trick anyway, so I'm not sure it's a real issue. > > + vb->vdev->config->get(vb->vdev, > > + offsetof(struct virtio_balloon_config, num_pages), > > + &v); > > this is missing a sizeof(v), no? Ah... sure enough, I fixed that in a followon patch. Well-spotted, thanks! Cheers, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote: > > The virtio config space was originally chosen to be little-endian, > > because we thought the config might be part of the PCI config space > > for virtio_pci. It's actually a separate mmio region, so that > > argument holds little water; as only x86 is currently using the virtio > > mechanism, we can change this (but must do so now, before the > > impending s390 and ppc merges). > > This will probably annoy Hollis which has guests that can go both ways. Yes, I discussed this with Hollis. But the virtio rings themselves already have this issue: we don't do any endian conversion on them and assume they're "our" endian in the guest. We may still regret not doing *everything* little-endian, but this doesn't make it worse. Thanks, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: > This patch changes virtio devices to be multi-function devices whenever > possible. This increases the number of virtio devices we can support now by > a factor of 8. > > With this patch, I've been able to launch a guest with either 220 disks or 220 > network adapters. > > Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. (certainly if we have a device with one function, which then gets converted to a multifunction device) -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] KVM: PIT: make last_injected_time per-guest
Otherwise multiple guests use the same variable and boom. Also use kvm_vcpu_kick() to make sure that if a timer triggers on a different CPU the event won't be missed. Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]> Tested-and-Acked-by: Alex Davis <[EMAIL PROTECTED]> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 2852dd1..5697ad2 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -200,10 +200,8 @@ int __pit_timer_fn(struct kvm_kpit_state *ps) atomic_inc(&pt->pending); smp_mb__after_atomic_inc(); - if (vcpu0 && waitqueue_active(&vcpu0->wq)) { - vcpu0->arch.mp_state = KVM_MP_STATE_RUNNABLE; - wake_up_interruptible(&vcpu0->wq); - } + if (vcpu0) + kvm_vcpu_kick(vcpu0); pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period); pt->scheduled = ktime_to_ns(pt->timer.expires); @@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) struct kvm_pit *pit = vcpu->kvm->arch.vpit; struct kvm *kvm = vcpu->kvm; struct kvm_kpit_state *ps; - static unsigned long last_injected_time; if (vcpu && pit) { ps = &pit->pit_state; @@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) * 2. Last interrupt was accepted or waited for too long time*/ if (atomic_read(&ps->pit_timer.pending) && (ps->inject_pending || - (jiffies - last_injected_time + (jiffies - ps->last_injected_time >= KVM_MAX_PIT_INTR_INTERVAL))) { ps->inject_pending = 0; __inject_pit_timer_intr(kvm); - last_injected_time = jiffies; + ps->last_injected_time = jiffies; } } } diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h index e63ef38..db25c2a 100644 --- a/arch/x86/kvm/i8254.h +++ b/arch/x86/kvm/i8254.h @@ -35,6 +35,7 @@ struct kvm_kpit_state { struct mutex lock; struct kvm_pit *pit; bool inject_pending; /* if inject pending interrupts */ + unsigned long last_injected_time; }; struct kvm_pit { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Make virtio devices multi-function
This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. I haven't tested the Windows virtio drivers. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index 60e4094..df3a878 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -33,7 +33,7 @@ typedef struct PCIIORegion { #define PCI_ROM_SLOT 6 #define PCI_NUM_REGIONS 7 -#define PCI_DEVICES_MAX 64 +#define PCI_DEVICES_MAX 256 #define PCI_VENDOR_ID 0x00/* 16 bits */ #define PCI_DEVICE_ID 0x02/* 16 bits */ diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 9100bb1..9ea14d3 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, PCIDevice *pci_dev; uint8_t *config; uint32_t size; +static int devfn = 7; + +if ((devfn % 8) == 7) + devfn = -1; +else + devfn++; pci_dev = pci_register_device(bus, name, struct_size, - -1, NULL, NULL); + devfn, NULL, NULL); + +devfn = pci_dev->devfn; + vdev = to_virtio_device(pci_dev); vdev->status = 0; @@ -435,6 +444,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, config[0x3d] = 1; +/* Mark device as multi-function */ +if ((devfn % 8) == 0) + config[0x0e] |= 0x80; + vdev->name = name; vdev->config_len = config_size; if (vdev->config_len) diff --git a/qemu/net.h b/qemu/net.h index 13daa27..3bada75 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd); /* NIC info */ -#define MAX_NICS 8 +#define MAX_NICS 256 struct NICInfo { uint8_t macaddr[6]; diff --git a/qemu/sysemu.h b/qemu/sysemu.h index b645fb7..7992a77 100644 --- a/qemu/sysemu.h +++ b/qemu/sysemu.h @@ -151,7 +151,7 @@ typedef struct DriveInfo { #define MAX_IDE_DEVS 2 #define MAX_SCSI_DEVS 7 -#define MAX_DRIVES 32 +#define MAX_DRIVES 256 int nb_drives; DriveInfo drives_table[MAX_DRIVES+1]; diff --git a/qemu/vl.c b/qemu/vl.c index 7dd0094..e203a4d 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -8754,7 +8754,7 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type) } #endif -#define MAX_NET_CLIENTS 32 +#define MAX_NET_CLIENTS 512 static int saved_argc; static char **saved_argv; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292 # Parent 8965539f4d174c79bd37e58e8b037d5db906e219 The conversion to a rwsem allows notifier callbacks during rmap traversal for files. A rw style lock also allows concurrent walking of the reverse map so that multiple processors can expire pages in the same memory area of the same process. So it increases the potential concurrency. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/Documentation/vm/locking b/Documentation/vm/locking --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_sem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ if (!vma_shareable(vma, addr)) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); if (!prio_tree_empty(&mapping->i_mmap)) hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); rwlock_init(&inode->i_data.tree_lock); - spin_lock_init(&inode->i_data.i_mmap_lock); + init_rwsem(&inode->i_data.i_mmap_sem); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -503,7 +503,7 @@ unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -716,7 +716,7 @@ struct address_space *check_mapping;/* Check page->mapping if set */ pgoff_t first_index;/* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -274,12 +274,12 @@ atomic_dec(&inode->i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(&file->f_mapping->i_mmap_lock); + down_write(&file->f_mapping->i_mmap_sem); tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(file->f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file->f_mapping); - spin_unlock(&fi
[kvm-devel] [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2 # Parent 6e04df1f4284689b1c46e57a67559abe49ecf292 Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -25,7 +25,8 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock;/* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; @@ -43,18 +44,31 @@ kmem_cache_free(anon_vma_cachep, anon_vma); } +struct anon_vma *grab_anon_vma(struct page *page); + +static inline void get_anon_vma(struct anon_vma *anon_vma) +{ + atomic_inc(&anon_vma->refcount); +} + +static inline void put_anon_vma(struct anon_vma *anon_vma) +{ + if (atomic_dec_and_test(&anon_vma->refcount)) + anon_vma_free(anon_vma); +} + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); } /* diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -235,15 +235,16 @@ return; /* -* We hold the mmap_sem lock. So no need to call page_lock_anon_vma. +* We hold either the mmap_sem lock or a reference on the +* anon_vma. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); + down_read(&anon_vma->sem); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(&anon_vma->lock); + up_read(&anon_vma->sem); } /* @@ -623,7 +624,7 @@ int rc = 0; int *result = NULL; struct page *newpage = get_new_page(page, private, &result); - int rcu_locked = 0; + struct anon_vma *anon_vma = NULL; int charge = 0; if (!newpage) @@ -647,16 +648,14 @@ } /* * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, -* we cannot notice that anon_vma is freed while we migrates a page. +* we cannot notice that anon_vma is freed while we migrate a page. * This rcu_read_lock() delays freeing anon_vma pointer until the end * of migration. File cache pages are no problem because of page_lock() * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { - rcu_read_lock(); - rcu_locked = 1; - } + if (PageAnon(page)) +
[kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872186 -7200 # Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7 # Parent 3c804dca25b15017b22008647783d6f5f3801fa9 get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -442,7 +442,8 @@ if (task->flags & PF_BORROWED_MM) mm = NULL; else - atomic_inc(&mm->mm_users); + if (!atomic_inc_not_zero(&mm->mm_users)) + mm = NULL; } task_unlock(task); return mm; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 00 of 12] mmu notifier #v13
Hello, This is the latest and greatest version of the mmu notifier patch #v13. Changes are mainly in the mm_lock that uses sort() suggested by Christoph. This reduces the complexity from O(N**2) to O(N*log(N)). I folded the mm_lock functionality together with the mmu-notifier-core 1/12 patch to make it self-contained. I recommend merging 1/12 into -mm/mainline ASAP. Lack of mmu notifiers is holding off KVM development. We are going to rework the way the pages are mapped and unmapped to work with pure pfn for pci passthrough without the use of page pinning, and we can't without mmu notifiers. This is not just a performance matter. KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed further. XPMEM support comes with the later patches 2-12, risk for those patches is >0 and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are simple and can go in immediately but not all are so simple. 2-12/12 are posted as usual for review by the VM developers and so Robin can keep testing them on XPMEM and they can be merged later without any downside (they're mostly orthogonal with 1/12). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 07 of 12] Add a function to rw_semaphores to check if there are any processes
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID 8965539f4d174c79bd37e58e8b037d5db906e219 # Parent fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5 Add a function to rw_semaphores to check if there are any processes waiting for the semaphore. Add rwsem_needbreak to sched.h that works in the same way as spinlock_needbreak(). Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -59,6 +59,8 @@ */ extern void downgrade_write(struct rw_semaphore *sem); +extern int rwsem_is_contended(struct rw_semaphore *sem); + #ifdef CONFIG_DEBUG_LOCK_ALLOC /* * nested locking. NOTE: rwsems are not allowed to recurse diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1984,6 +1984,15 @@ #endif } +static inline int rwsem_needbreak(struct rw_semaphore *sem) +{ +#ifdef CONFIG_PREEMPT + return rwsem_is_contended(sem); +#else + return 0; +#endif +} + /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c --- a/lib/rwsem-spinlock.c +++ b/lib/rwsem-spinlock.c @@ -305,6 +305,18 @@ spin_unlock_irqrestore(&sem->wait_lock, flags); } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* +* Racy check for an empty list. False positives or negatives +* would be okay. False positive may cause a useless dropping of +* locks. False negatives may cause locks to be held a bit +* longer until the next check. +*/ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(__init_rwsem); EXPORT_SYMBOL(__down_read); EXPORT_SYMBOL(__down_read_trylock); diff --git a/lib/rwsem.c b/lib/rwsem.c --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -251,6 +251,18 @@ return sem; } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* +* Racy check for an empty list. False positives or negatives +* would be okay. False positive may cause a useless dropping of +* locks. False negatives may cause locks to be held a bit +* longer until the next check. +*/ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(rwsem_down_read_failed); EXPORT_SYMBOL(rwsem_down_write_failed); EXPORT_SYMBOL(rwsem_wake); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872186 -7200 # Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5 # Parent ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0 Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -723,8 +723,7 @@ struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -804,7 +804,6 @@ /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -816,20 +815,13 @@ * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -838,9 +830,14 @@ int tlb_start_valid = 0; unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; - int fullmm = (*tlbp)->fullmm; + int fullmm; + struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb->fullmm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -867,7 +864,7 @@ (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, &zap_work, details); if (zap_work > 0) { @@ -875,22 +872,23 @@ break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { if (i_mmap_lock) { - *tlbp = NULL; + tlb = NULL; goto out; } cond_resched(); } - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm); + tlb = tlb_gather_mmu(vma->vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); out: mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ @@ -906,18 +904,10 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma->vm_mm; - struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0);
[kvm-devel] [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID 128d705f38c8a774ac11559db445787ce6e91c77 # Parent f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93 XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson <[EMAIL PROTECTED]> diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -909,6 +909,7 @@ return unmap_vmas(vma, address, end, &nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID e847039ee2e815088661933b7195584847dc7540 # Parent 128d705f38c8a774ac11559db445787ce6e91c77 This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson <[EMAIL PROTECTED]> diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ * * ->i_mutex (generic_file_buffered_write) *->mmap_sem (fault_in_pages_readable->do_page_fault) + * + *When taking multiple mmap_sems, one should lock the lowest-addressed + *one first proceeding on up to the highest-addressed one. * * ->i_mutex *->i_alloc_sem (various) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872186 -7200 # Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0 # Parent ac9bb1fb3de2aa5d27210a28edf24f6577094076 Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables() and we cannot sleep while gathering pages for a tlb flush. Move the tlb_gather/tlb_finish call to free_pgtables() to be done for each vma. This may add a number of tlb flushes depending on the number of vmas that cannot be coalesced into one. The first pointer argument to free_pgtables() can then be dropped. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -751,8 +751,8 @@ void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -272,9 +272,11 @@ } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; @@ -286,7 +288,8 @@ unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + hugetlb_free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { /* @@ -299,9 +302,11 @@ anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } + tlb_finish_mmu(tlb, addr, vma->vm_end); vma = next; } } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1752,9 +1752,9 @@ update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, + tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); - tlb_finish_mmu(tlb, start, end); } /* @@ -2050,8 +2050,8 @@ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
# HG changeset patch # User Andrea Arcangeli <[EMAIL PROTECTED]> # Date 1208872187 -7200 # Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93 # Parent bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2 Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock conversion. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1062,10 +1062,10 @@ * mm_lock and mm_unlock are expensive operations that may take a long time. */ struct mm_lock_data { - spinlock_t **i_mmap_locks; - spinlock_t **anon_vma_locks; - size_t nr_i_mmap_locks; - size_t nr_anon_vma_locks; + struct rw_semaphore **i_mmap_sems; + struct rw_semaphore **anon_vma_sems; + size_t nr_i_mmap_sems; + size_t nr_anon_vma_sems; }; extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2243,8 +2243,8 @@ static int mm_lock_cmp(const void *a, const void *b) { cond_resched(); - if ((unsigned long)*(spinlock_t **)a < - (unsigned long)*(spinlock_t **)b) + if ((unsigned long)*(struct rw_semaphore **)a < + (unsigned long)*(struct rw_semaphore **)b) return -1; else if (a == b) return 0; @@ -2252,7 +2252,7 @@ return 1; } -static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, +static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems, int anon) { struct vm_area_struct *vma; @@ -2261,59 +2261,59 @@ for (vma = mm->mmap; vma; vma = vma->vm_next) { if (anon) { if (vma->anon_vma) - locks[i++] = &vma->anon_vma->lock; + sems[i++] = &vma->anon_vma->sem; } else { if (vma->vm_file && vma->vm_file->f_mapping) - locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem; } } if (!i) goto out; - sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL); out: return i; } static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 1); + return mm_lock_sort(mm, sems, 1); } static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 0); + return mm_lock_sort(mm, sems, 0); } -static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock) { - spinlock_t *last = NULL; + struct rw_semaphore *last = NULL; size_t i; for (i = 0; i < nr; i++) /* Multiple vmas may use the same lock. */ - if (locks[i] != last) { - BUG_ON((unsigned long) last > (unsigned long) locks[i]); - last = locks[i]; + if (sems[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) sems[i]); + last = sems[i]; if (lock) - spin_lock(last); + down_write(last); else - spin_unlock(last); + up_write(last); } } -static inline void __mm_lock(spinlock_t **locks, size_t nr) +static inline void __mm_lock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 1); + mm_lock_unlock(sems, nr, 1); } -static inline void __mm_unlock(spinlock_t **locks, size_t nr) +static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 0); + mm_lock_unlock(sems, nr, 0); } /* @@ -2325,57 +2325,57 @@ */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { - spinlock_t **anon_vma_locks, **i_mmap_locks; + struct rw_semaphore **anon_vma_sems, **i_mmap_sems; down_write(&mm->mmap_sem); if (mm->map_count) { - anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!anon_vma_locks)) { + anon_vma_sems = vmalloc(sizeof(struct rw_semaph