[PATCH] Use kvm_on_each_cpu in preempt.c

2008-11-07 Thread Alexander Graf
Compat modules need to use kvm_on_each_cpu instead on on_each_cpu
in order to be compatible throughout various kernel versions.

Unfortunately preempt.c uses on_each_cpu. This patch changes preempt.c
to also use the kvm version, so it is compatible with the rest of the
external module build.

This fixes building on newer kernels without preempt notifiers.

Signed-off-by: Alexander Graf <[EMAIL PROTECTED]>
---
 kernel/x86/preempt.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/x86/preempt.c b/kernel/x86/preempt.c
index a6b69ec..9e4bd2c 100644
--- a/kernel/x86/preempt.c
+++ b/kernel/x86/preempt.c
@@ -251,7 +251,7 @@ void preempt_notifier_sys_exit(void)
struct idt_desc idt_desc;
 
dprintk("\n");
-   on_each_cpu(do_disable, NULL, 1, 1);
+   kvm_on_each_cpu(do_disable, NULL, 1);
asm ("sidt %0" : "=m"(idt_desc));
idt_desc.gates[1] = orig_int1_gate;
 }
-- 
1.5.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Yu Zhao

Greg KH wrote:

On Sat, Nov 08, 2008 at 01:00:29PM +0800, Yu Zhao wrote:

Greg KH wrote:

On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:

Greg KH wrote:

On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:

Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.
Turns out, that doesn't really work for disk and network devices 
without
dropping the connection (well, network devices should be fine 
probably).

So you just can't do this, sorry.  That's why the BIOS handles all of
these issues in a PCI hotplug system.
How does the hardware people think we are going to handle this in the
OS?  It's not something that any operating system can do, is it part 
of

the IOV PCI spec somewhere?

No, it's not part of the PCI IOV spec.

I just want the IOV (and whole PCI subsystem) have more flexibility on 
various BIOSes. So can we reconsider about resource rebalance as boot 
option, or should we forget about this idea?

As you have proposed it, the boot option will not work at all, so I
think we need to forget about it.  Especially if it is not really
needed.
I guess at least one thing would work if people don't want to boot twice: 
give the bus number 0 as rebalance starting point, then all system 
resources would be reshuffled :-)

Hm, but don't we do that today with our basic resource reservation logic
at boot time?  What would be different about this kind of proposal?
The generic PCI core can do this but this feature is kind of disabled by 
low level PCI code in x86. The low level code tries to reserve resource 
according to configuration from BIOS. If the BIOS is wrong, the allocation 
would fail and the generic PCI core couldn't repair it because the bridge 
resources may have been allocated by the PCI low level and the PCI core 
can't expand them to find enough resource for the subordinates.


Yes, we do this on purpose.

The proposal is to disable x86 PCI low level to allocation resources 
according to BIOS so PCI core can fully control the resource allocation. 
The PCI core takes all resources from BARs it knows into account and 
configure the resource windows on the bridges according to its own 
calculation.


Ah, so you mean we should revert back to the way we use to do x86 PCI
resource allocation from about a year and a half ago to about 8 years
ago?

Hint, there was a reason why we switched over to using the BIOS instead
of doing it ourselves.  Turns out we have to trust the BIOS here, as
that is exactly what other operating systems do.  Trying to do it on our
own was too fragile and resulted in too many problems over time.

Go look at the archives for when this all was switched, you'll see the
reasons why.

So no, we will not be going back to the way we used to do things, we
changed for a reason :)


So it's really a long story, and I'm glad to see the reason.

Actually there was no such thing in early SR-IOV patches, but months ago 
I heard some complaints that pushed me to do this kind of reverse. Looks 
like I have to let these complaints turn to BIOS people from now on :-)


Regards,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Greg KH
On Sat, Nov 08, 2008 at 01:50:20PM +0800, freevanx wrote:
> Dear all,
> 
> I'm glad to hear this. In fact, I'm developing for BIOS area. This feature
> is very useful when your system have one or more PCI/PCIe hotplug slot.
> Generally, BIOS reserve amount of resource for empty hotplug slot by
> default, but it is not always enough for all device. We have many kind of
> Express Modules which consume different amount of resouce, generally we
> reserve a small number of resouce for this, so, sometime, some Express
> Modules hotplug without card installed at boot time, it will not useable.

Then fix the BIOS :)

Seriously, that is what the PCI hotplug spec says to do, right?

> Then, Microsoft say they implement PCI Multi-level Resource Rebanlence in
> Vista and Server 2008, you can refer
> http://www.microsoft.com/whdc/archive/multilevel-rebal.mspx
> http://www.microsoft.com/whdc/connect/pci/PCI-rsc.mspx

But they did not implement this for Vista, and pulled it before it
shipped, right?  That is what the driver development documentation for
Vista said that I read.

Do you know if they are going to add it back for Windows 7?  If so, then
we should probably look into this, otherwise, no need to, as the BIOSes
will be fixed properly.

> They use a method of ACPI to tell OS that you can ignore the resource
> allocation of PCI devices below the bridge. I think this is more useful than
> specify the BUS number to ignore resouce allocation, because the BUS number
> often change due to some need by BIOS or new PCI/PCIe device added in
> system. Users generally do not know the system architecture and can not
> specify the BUS number of the root bridge, while if you specify the _DSM
> method like MS to the root bridge of hotplug slot, it is a much easier way
> to archive for BIOS writers.

Yes, push the burden of getting this right onto the OS developers,
instead of doing it properly in the BIOS, how fun :(

Seriously, it isn't that hard to reserve enough space on most machines
in the BIOS to get this correct.  It only gets messy when you have
hundreds of hotplug PCI slots and bridges.  Even then, the BIOS writers
have been able to resolve this for a while due to this kind of hardware
shipping successfully with Linux for many years now.

> PS:
> Since my mail address was blocked by the maillist, this mail may not reach
> people who only in linux kernel maillist.

It is being blocked because you are sending out html email.

Please reconfigure your gmail client to not do that, and your mail will
go through just fine.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Greg KH
On Sat, Nov 08, 2008 at 01:00:29PM +0800, Yu Zhao wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
>>> Greg KH wrote:
 On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>> Well, to do it "correctly" you are going to have to tell the driver to
>> shut itself down, and reinitialize itself.
>> Turns out, that doesn't really work for disk and network devices 
>> without
>> dropping the connection (well, network devices should be fine 
>> probably).
>> So you just can't do this, sorry.  That's why the BIOS handles all of
>> these issues in a PCI hotplug system.
>> How does the hardware people think we are going to handle this in the
>> OS?  It's not something that any operating system can do, is it part 
>> of
>> the IOV PCI spec somewhere?
> No, it's not part of the PCI IOV spec.
>
> I just want the IOV (and whole PCI subsystem) have more flexibility on 
> various BIOSes. So can we reconsider about resource rebalance as boot 
> option, or should we forget about this idea?
 As you have proposed it, the boot option will not work at all, so I
 think we need to forget about it.  Especially if it is not really
 needed.
>>> I guess at least one thing would work if people don't want to boot twice: 
>>> give the bus number 0 as rebalance starting point, then all system 
>>> resources would be reshuffled :-)
>> Hm, but don't we do that today with our basic resource reservation logic
>> at boot time?  What would be different about this kind of proposal?
>
> The generic PCI core can do this but this feature is kind of disabled by 
> low level PCI code in x86. The low level code tries to reserve resource 
> according to configuration from BIOS. If the BIOS is wrong, the allocation 
> would fail and the generic PCI core couldn't repair it because the bridge 
> resources may have been allocated by the PCI low level and the PCI core 
> can't expand them to find enough resource for the subordinates.

Yes, we do this on purpose.

> The proposal is to disable x86 PCI low level to allocation resources 
> according to BIOS so PCI core can fully control the resource allocation. 
> The PCI core takes all resources from BARs it knows into account and 
> configure the resource windows on the bridges according to its own 
> calculation.

Ah, so you mean we should revert back to the way we use to do x86 PCI
resource allocation from about a year and a half ago to about 8 years
ago?

Hint, there was a reason why we switched over to using the BIOS instead
of doing it ourselves.  Turns out we have to trust the BIOS here, as
that is exactly what other operating systems do.  Trying to do it on our
own was too fragile and resulted in too many problems over time.

Go look at the archives for when this all was switched, you'll see the
reasons why.

So no, we will not be going back to the way we used to do things, we
changed for a reason :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Yu Zhao

Greg KH wrote:

On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:

Greg KH wrote:

On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:

Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.
Turns out, that doesn't really work for disk and network devices without
dropping the connection (well, network devices should be fine probably).
So you just can't do this, sorry.  That's why the BIOS handles all of
these issues in a PCI hotplug system.
How does the hardware people think we are going to handle this in the
OS?  It's not something that any operating system can do, is it part of
the IOV PCI spec somewhere?

No, it's not part of the PCI IOV spec.

I just want the IOV (and whole PCI subsystem) have more flexibility on 
various BIOSes. So can we reconsider about resource rebalance as boot 
option, or should we forget about this idea?

As you have proposed it, the boot option will not work at all, so I
think we need to forget about it.  Especially if it is not really
needed.
I guess at least one thing would work if people don't want to boot twice: 
give the bus number 0 as rebalance starting point, then all system 
resources would be reshuffled :-)


Hm, but don't we do that today with our basic resource reservation logic
at boot time?  What would be different about this kind of proposal?


The generic PCI core can do this but this feature is kind of disabled by 
low level PCI code in x86. The low level code tries to reserve resource 
according to configuration from BIOS. If the BIOS is wrong, the 
allocation would fail and the generic PCI core couldn't repair it 
because the bridge resources may have been allocated by the PCI low 
level and the PCI core can't expand them to find enough resource for the 
subordinates.


The proposal is to disable x86 PCI low level to allocation resources 
according to BIOS so PCI core can fully control the resource allocation. 
The PCI core takes all resources from BARs it knows into account and 
configure the resource windows on the bridges according to its own 
calculation.


Regards,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Nick Piggin
On Saturday 08 November 2008 13:13, Glauber Costa wrote:
> On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote:
> > On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> > > Nick,
> > >
> > > This is the whole set of patches I was talking about.
> > > Patch 3 is the one that in fact fixes the problem
> > > Patches 1 and 2 are debugging aids I made use of, and could be possibly
> > > useful to others
> > > Patch 4 removes guard pages entirely for non-debug kernels, as we have
> > > already previously discussed.
> > >
> > > Hope it's all fine.
> >
> > OK, these all look good, but I may only push 3/4 for Linus in this round,
> > along with some of the changes from my patch that you tested as well.
>
> Makes total sense.

OK, sent. Thanks again.


> > With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps
> > should turn off the lazy unmapping optimisation as well, so it catches
> > use after free similarly to the page allocator... but probably it is a
> > good idea at least to avoid the double-guard page for 2.6.28?
>
> Makes sense. Maybe poisoning after free would also be useful?

It's a problem because we're only dealing with virtual address, rather
than real memory. So we don't really have anything to poison (we don't
know what the caller will do with the memory). I guess it would be
possible to poison in the page allocator or in vfree, but probably
not worthwhile (after the immediate-unmap debug option).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Glauber Costa
On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote:
> On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> > Nick,
> > 
> > This is the whole set of patches I was talking about.
> > Patch 3 is the one that in fact fixes the problem
> > Patches 1 and 2 are debugging aids I made use of, and could be possibly
> > useful to others
> > Patch 4 removes guard pages entirely for non-debug kernels, as we have 
> > already
> > previously discussed.
> > 
> > Hope it's all fine.
> 
> OK, these all look good, but I may only push 3/4 for Linus in this round,
> along with some of the changes from my patch that you tested as well.

Makes total sense.
> 
> With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should
> turn off the lazy unmapping optimisation as well, so it catches use
> after free similarly to the page allocator... but probably it is a good
> idea at least to avoid the double-guard page for 2.6.28?

Makes sense. Maybe poisoning after free would also be useful?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Nick Piggin
On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> Nick,
> 
> This is the whole set of patches I was talking about.
> Patch 3 is the one that in fact fixes the problem
> Patches 1 and 2 are debugging aids I made use of, and could be possibly
> useful to others
> Patch 4 removes guard pages entirely for non-debug kernels, as we have already
> previously discussed.
> 
> Hope it's all fine.

OK, these all look good, but I may only push 3/4 for Linus in this round,
along with some of the changes from my patch that you tested as well.

With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should
turn off the lazy unmapping optimisation as well, so it catches use
after free similarly to the page allocator... but probably it is a good
idea at least to avoid the double-guard page for 2.6.28?

Anyway thanks for these, I'll send them up to Andrew/Linus and cc you.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2219447 ] kvm_run: Cannot allocate memory

2008-11-07 Thread SourceForge.net
Bugs item #2219447, was opened at 2008-11-03 20:32
Message generated for change (Comment added) made by glommer
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2219447&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: intel
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Torsten Wohlfarth (towo2099)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm_run: Cannot allocate memory

Initial Comment:
Since i testet kernel 2.6.28-rc2 and rc3, kvm quit running wit the following 
error:

kvm -hda vdisk_xp.qcow -m 1024 -smp 2 -soundhw ac97
kvm_run: Cannot allocate memory
kvm_run returned -12

It does not matter, if the gues is a linux or xp.

kvm version is kvm-78
system is Linux Defiant 2.6.28-rc3-towo-1 #1 SMP PREEMPT Mon Nov 3 16:46:41 CET 
2008 i686 GNU/Linux
cpu is Intel Core2 Quad Q6700 @ 4096 KB cache flags( sse3 nx lm vmx ) 
the no-kvm-switches does not help

Booting in kernel 2.6.27.4 let kvm running fine.

--

Comment By: Glauber de Oliveira Costa (glommer)
Date: 2008-11-08 00:37

Message:
No, using that parameter won't work.

I just got the fix for it today. It's not in any tree yet, but since
you're using git,
I'll assume you're confortable with applying a patch in your tree ;-)

You can download it from
http://glommer.net/0003-restart-search-at-beggining-of-vmalloc-address.patch

--

Comment By: walt (w41ter)
Date: 2008-11-07 21:07

Message:
FWIW this seems to be a 32-bit-only problem for me. I track Linus.git and
kvm-userspace.git in 32-bit and 64-bit linux installations on one
machine,
and the 64-bit kvm works perfectly.  Only the 32-bit kvm has this error.

Can I work around the problem with vmalloc=?  If yes, where do I
use
that vmalloc flag?

Thanks.


--

Comment By: Torsten Wohlfarth (towo2099)
Date: 2008-11-03 21:13

Message:
Yeah, there are many messages about vmalloc:
http://rafb.net/p/4RvqBX81.html


--

Comment By: Glauber de Oliveira Costa (glommer)
Date: 2008-11-03 21:08

Message:
check your dmesg in the host.

If there's any message about vmalloc failing, so this is a known issue.
I have a band aid patch that helps it, but we're still not sure what the
proper fix is.
I'm working on it right now.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2219447&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bios: resolve memory device roll over reporting issues with >32G guests

2008-11-07 Thread Bill Rieske
The field within the Memory Device type 17 is only a word with the MSB being 
used to report MB/KB.  Thereby, a guest with 32G and greater would report 
incorrect memory device information rolling over to 0.

This presents more than one memory device and associated memory structures if 
the memory is larger than 16G

Signed-off-by: Bill Rieske 

diff --git a/bios/rombios32.c b/bios/rombios32.c
index a91b155..bc69945 100755
--- a/bios/rombios32.c
+++ b/bios/rombios32.c
@@ -173,6 +173,26 @@ static inline int isdigit(int c)
 return c >= '0' && c <= '9';
 }
 
+char *itoa(char *a, unsigned int i)
+{
+unsigned int _i = i, x = 0;
+
+do {
+x++;
+_i /= 10;
+} while ( _i != 0 );
+
+a += x;
+*a-- = '\0';
+
+do {
+*a-- = (i % 10) + '0';
+i /= 10;
+} while ( i != 0 );
+
+return a + 1;
+}
+
 void *memset(void *d1, int val, size_t len)
 {
 uint8_t *d = d1;
@@ -220,6 +240,16 @@ size_t strlen(const char *s)
 return s1 - s;
 }
 
+char *
+strcpy(char *dest, const char *src)
+{
+char *p = dest;
+while ( *src )
+*p++ = *src++;
+*p = 0;
+return dest;
+}
+
 /* from BSD ppp sources */
 int vsnprintf(char *buf, int buflen, const char *fmt, va_list args)
 {
@@ -1914,7 +1944,7 @@ smbios_type_4_init(void *start, unsigned int cpu_number)
 
 /* Type 16 -- Physical Memory Array */
 static void *
-smbios_type_16_init(void *start, uint32_t memsize)
+smbios_type_16_init(void *start, uint32_t memsize, int nr_mem_devs)
 {
 struct smbios_type_16 *p = (struct smbios_type_16*)start;
 
@@ -1927,7 +1957,7 @@ smbios_type_16_init(void *start, uint32_t memsize)
 p->error_correction = 0x01; /* other */
 p->maximum_capacity = memsize * 1024;
 p->memory_error_information_handle = 0xfffe; /* none provided */
-p->number_of_memory_devices = 1;
+p->number_of_memory_devices = nr_mem_devs;
 
 start += sizeof(struct smbios_type_16);
 *((uint16_t *)start) = 0;
@@ -1937,20 +1967,20 @@ smbios_type_16_init(void *start, uint32_t memsize)
 
 /* Type 17 -- Memory Device */
 static void *
-smbios_type_17_init(void *start, uint32_t memory_size_mb)
+smbios_type_17_init(void *start, uint32_t memory_size_mb, int instance)
 {
+char buf[16];
 struct smbios_type_17 *p = (struct smbios_type_17 *)start;
 
 p->header.type = 17;
 p->header.length = sizeof(struct smbios_type_17);
-p->header.handle = 0x1100;
+p->header.handle = 0x1100 + instance;
 
 p->physical_memory_array_handle = 0x1000;
 p->total_width = 64;
 p->data_width = 64;
-/* truncate memory_size_mb to 16 bits and clear most significant
-   bit [indicates size in MB] */
-p->size = (uint16_t) memory_size_mb & 0x7fff;
+/* TODO: should assert in case something is wrong   ASSERT((memory_size_mb & 
~0x7fff) == 0); */ 
+p->size = memory_size_mb;
 p->form_factor = 0x09; /* DIMM */
 p->device_set = 0;
 p->device_locator_str = 1;
@@ -1959,8 +1989,11 @@ smbios_type_17_init(void *start, uint32_t memory_size_mb)
 p->type_detail = 0;
 
 start += sizeof(struct smbios_type_17);
-memcpy((char *)start, "DIMM 1", 7);
-start += 7;
+memcpy((char *)start, "DIMM ", 6);
+start += strlen("DIMM ");
+itoa(buf, instance);
+strcpy(start, buf);
+start += strlen(buf) + 1;
 *((uint8_t *)start) = 0;
 
 return start+1;
@@ -1968,16 +2001,16 @@ smbios_type_17_init(void *start, uint32_t 
memory_size_mb)
 
 /* Type 19 -- Memory Array Mapped Address */
 static void *
-smbios_type_19_init(void *start, uint32_t memory_size_mb)
+smbios_type_19_init(void *start, uint32_t memory_size_mb, int instance)
 {
 struct smbios_type_19 *p = (struct smbios_type_19 *)start;
 
 p->header.type = 19;
 p->header.length = sizeof(struct smbios_type_19);
-p->header.handle = 0x1300;
+p->header.handle = 0x1300 + instance;
 
-p->starting_address = 0;
-p->ending_address = (memory_size_mb * 1024) - 1;
+p->starting_address = instance << 24;
+p->ending_address = p->starting_address + (memory_size_mb << 10) - 1;
 p->memory_array_handle = 0x1000;
 p->partition_width = 1;
 
@@ -1989,18 +2022,18 @@ smbios_type_19_init(void *start, uint32_t 
memory_size_mb)
 
 /* Type 20 -- Memory Device Mapped Address */
 static void *
-smbios_type_20_init(void *start, uint32_t memory_size_mb)
+smbios_type_20_init(void *start, uint32_t memory_size_mb, int instance)
 {
 struct smbios_type_20 *p = (struct smbios_type_20 *)start;
 
 p->header.type = 20;
 p->header.length = sizeof(struct smbios_type_20);
-p->header.handle = 0x1400;
+p->header.handle = 0x1400 + instance;
 
-p->starting_address = 0;
-p->ending_address = (memory_size_mb * 1024) - 1;
-p->memory_device_handle = 0x1100;
-p->memory_array_mapped_address_handle = 0x1300;
+p->starting_address = instance << 24;
+p->ending_address = p->starting_address + (memory_size_mb << 10) - 1;
+p->memory_device_handle = 0x1100 + instance;
+ 

Re: [PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread walt

Glauber Costa wrote:

Nick,

This is the whole set of patches I was talking about.
Patch 3 is the one that in fact fixes the problem...


Yep, patch 3 works for me, thanks.  Only the 32-bit kernel
seems to need the patch, FWIW.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2219447 ] kvm_run: Cannot allocate memory

2008-11-07 Thread SourceForge.net
Bugs item #2219447, was opened at 2008-11-03 20:32
Message generated for change (Comment added) made by w41ter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2219447&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: intel
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Torsten Wohlfarth (towo2099)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm_run: Cannot allocate memory

Initial Comment:
Since i testet kernel 2.6.28-rc2 and rc3, kvm quit running wit the following 
error:

kvm -hda vdisk_xp.qcow -m 1024 -smp 2 -soundhw ac97
kvm_run: Cannot allocate memory
kvm_run returned -12

It does not matter, if the gues is a linux or xp.

kvm version is kvm-78
system is Linux Defiant 2.6.28-rc3-towo-1 #1 SMP PREEMPT Mon Nov 3 16:46:41 CET 
2008 i686 GNU/Linux
cpu is Intel Core2 Quad Q6700 @ 4096 KB cache flags( sse3 nx lm vmx ) 
the no-kvm-switches does not help

Booting in kernel 2.6.27.4 let kvm running fine.

--

Comment By: walt (w41ter)
Date: 2008-11-07 21:07

Message:
FWIW this seems to be a 32-bit-only problem for me. I track Linus.git and
kvm-userspace.git in 32-bit and 64-bit linux installations on one
machine,
and the 64-bit kvm works perfectly.  Only the 32-bit kvm has this error.

Can I work around the problem with vmalloc=?  If yes, where do I
use
that vmalloc flag?

Thanks.


--

Comment By: Torsten Wohlfarth (towo2099)
Date: 2008-11-03 21:13

Message:
Yeah, there are many messages about vmalloc:
http://rafb.net/p/4RvqBX81.html


--

Comment By: Glauber de Oliveira Costa (glommer)
Date: 2008-11-03 21:08

Message:
check your dmesg in the host.

If there's any message about vmalloc failing, so this is a known issue.
I have a band aid patch that helps it, but we're still not sure what the
proper fix is.
I'm working on it right now.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2219447&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Do not use guard pages in non-debug kernels

2008-11-07 Thread Glauber Costa
In mm/vmalloc.c, make usage of guard pages dependant
on CONFIG_DEBUG_PAGEALLOC.

Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
---
 mm/vmalloc.c |   25 +++--
 1 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6fe2003..ed73c6f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -28,6 +28,11 @@
 #include 
 #include 
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define GUARD_PAGE_SIZE PAGE_SIZE
+#else
+#define GUARD_PAGE_SIZE 0
+#endif
 
 /*** Page table manipulation functions ***/
 
@@ -363,7 +368,7 @@ retry:
}
 
while (addr + size >= first->va_start && addr + size <= vend) {
-   addr = ALIGN(first->va_end + PAGE_SIZE, align);
+   addr = ALIGN(first->va_end, align);
 
n = rb_next(&first->rb_node);
if (n)
@@ -954,7 +959,7 @@ void unmap_kernel_range(unsigned long addr, unsigned long 
size)
 int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
 {
unsigned long addr = (unsigned long)area->addr;
-   unsigned long end = addr + area->size - PAGE_SIZE;
+   unsigned long end = addr + area->size - GUARD_PAGE_SIZE;
int err;
 
err = vmap_page_range(addr, end, prot, *pages);
@@ -1003,7 +1008,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long 
size,
/*
 * We always allocate a guard page.
 */
-   size += PAGE_SIZE;
+   size += GUARD_PAGE_SIZE;
 
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
@@ -1098,7 +1103,7 @@ struct vm_struct *remove_vm_area(const void *addr)
struct vm_struct *vm = va->private;
struct vm_struct *tmp, **p;
free_unmap_vmap_area(va);
-   vm->size -= PAGE_SIZE;
+   vm->size -= GUARD_PAGE_SIZE;
 
write_lock(&vmlist_lock);
for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)
@@ -1226,7 +1231,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
struct page **pages;
unsigned int nr_pages, array_size, i;
 
-   nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
+   nr_pages = (area->size - GUARD_PAGE_SIZE) >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
 
area->nr_pages = nr_pages;
@@ -1451,7 +1456,7 @@ long vread(char *buf, char *addr, unsigned long count)
read_lock(&vmlist_lock);
for (tmp = vmlist; tmp; tmp = tmp->next) {
vaddr = (char *) tmp->addr;
-   if (addr >= vaddr + tmp->size - PAGE_SIZE)
+   if (addr >= vaddr + tmp->size - GUARD_PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -1461,7 +1466,7 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp->size - PAGE_SIZE - addr;
+   n = vaddr + tmp->size - GUARD_PAGE_SIZE - addr;
do {
if (count == 0)
goto finished;
@@ -1489,7 +1494,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
read_lock(&vmlist_lock);
for (tmp = vmlist; tmp; tmp = tmp->next) {
vaddr = (char *) tmp->addr;
-   if (addr >= vaddr + tmp->size - PAGE_SIZE)
+   if (addr >= vaddr + tmp->size - GUARD_PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -1498,7 +1503,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp->size - PAGE_SIZE - addr;
+   n = vaddr + tmp->size - GUARD_PAGE_SIZE - addr;
do {
if (count == 0)
goto finished;
@@ -1544,7 +1549,7 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void 
*addr,
if (!(area->flags & VM_USERMAP))
return -EINVAL;
 
-   if (usize + (pgoff << PAGE_SHIFT) > area->size - PAGE_SIZE)
+   if (usize + (pgoff << PAGE_SHIFT) > area->size - GUARD_PAGE_SIZE)
return -EINVAL;
 
addr += pgoff << PAGE_SHIFT;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] restart search at beggining of vmalloc address

2008-11-07 Thread Glauber Costa
Current vmalloc restart search for a free area in case we
can't find one. The reason is there are areas which are lazily
freed, and could be possibly freed now. However, current implementation
start searching the tree from the last failing address, which is
pretty much by definition at the end of address space. So, we fail.

The proposal of this patch is to restart the search from the beginning
of the requested vstart address. This fixes the regression in running
KVM virtual machines for me, described in
http://lkml.org/lkml/2008/10/28/349, caused by commit
db64fe02258f1507e13fe5212a989922323685ce.

Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
CC: Nick Piggin <[EMAIL PROTECTED]>
---
 mm/vmalloc.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7db493d..6fe2003 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -378,6 +378,7 @@ found:
if (!purged) {
purge_vmap_area_lazy();
purged = 1;
+   addr = ALIGN(vstart, align);
goto retry;
}
if (printk_ratelimit())
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] show size of failing allocation

2008-11-07 Thread Glauber Costa
if we can't service a vmalloc allocation, show size of
the allocation that actually failed. Useful for
debugging.

Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
---
 mm/vmalloc.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 95856d1..7db493d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -381,8 +381,8 @@ found:
goto retry;
}
if (printk_ratelimit())
-   printk(KERN_WARNING "vmap allocation failed: "
-"use vmalloc= to increase size.\n");
+   printk(KERN_WARNING "vmap allocation for size %d 
failed: "
+"use vmalloc= to increase size.\n", 
size);
return ERR_PTR(-EBUSY);
}
 
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] don't call __vmalloc from other vmap internal functions

2008-11-07 Thread Glauber Costa
If we do that, output of files like /proc/vmallocinfo
will show things like "vmalloc_32", "vmalloc_user", or
whomever the caller was as the caller. This info is not
as useful as the real caller of the allocation.

So, proposal is to call __vmalloc_node node directly, with
matching parameters to save the caller information

Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
---
 mm/vmalloc.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0365369..95856d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1343,7 +1343,8 @@ void *vmalloc_user(unsigned long size)
struct vm_struct *area;
void *ret;
 
-   ret = __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, 
PAGE_KERNEL);
+   ret = __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO,
+PAGE_KERNEL, -1, __builtin_return_address(0));
if (ret) {
area = find_vm_area(ret);
area->flags |= VM_USERMAP;
@@ -1388,7 +1389,8 @@ EXPORT_SYMBOL(vmalloc_node);
 
 void *vmalloc_exec(unsigned long size)
 {
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
+   return __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM, 
PAGE_KERNEL_EXEC,
+ -1, __builtin_return_address(0));
 }
 
 #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
@@ -1408,7 +1410,8 @@ void *vmalloc_exec(unsigned long size)
  */
 void *vmalloc_32(unsigned long size)
 {
-   return __vmalloc(size, GFP_VMALLOC32, PAGE_KERNEL);
+   return __vmalloc_node(size, GFP_VMALLOC32, PAGE_KERNEL,
+ -1, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc_32);
 
@@ -1424,7 +1427,8 @@ void *vmalloc_32_user(unsigned long size)
struct vm_struct *area;
void *ret;
 
-   ret = __vmalloc(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL);
+   ret = __vmalloc_node(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
+-1, __builtin_return_address(0));
if (ret) {
area = find_vm_area(ret);
area->flags |= VM_USERMAP;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Fix vmalloc regression

2008-11-07 Thread Glauber Costa
Nick,

This is the whole set of patches I was talking about.
Patch 3 is the one that in fact fixes the problem
Patches 1 and 2 are debugging aids I made use of, and could be possibly
useful to others
Patch 4 removes guard pages entirely for non-debug kernels, as we have already
previously discussed.

Hope it's all fine.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] regression: vmalloc easily fail.

2008-11-07 Thread Glauber Costa
On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > Nick Piggin wrote:
> > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > >
> > > > >  
> > > > 
> > > > From the original report:
> > > > 
> > > > >VmallocTotal: 122880 kB
> > > > >VmallocUsed:   15184 kB
> > > > >VmallocChunk:  83764 kB
> > > > 
> > > > So it seems there's quite a bit of free space.
> > > > 
> > > > Chunk is the largest free contiguous region, right?  If so, it seems 
> > > > the 
> > > 
> > > Yes.
> > > 
> > > 
> > > > problem is unrelated to guard pages, instead the search isn't finding a 
> > > > 1-page area (with two guard pages) for some reason, even though lots of 
> > > > free space is available.
> > > 
> > > Hmm. The free area search could be buggy...
> > Do you want me to grab any specific info of it? Or should I just hack myself
> > randomly into it? I'll probably have some time for that tomorrow.
> 
> I took a bit of a look. Does this help you at all?
> 
> I still think we should get rid of the guard pages in non-debug kernels
> completely, but hopefully this will fix your problems?
> --
> 
> - Fix off by one bug in the KVA allocator that can leave gaps 
> - An initial vmalloc failure should start off a synchronous flush of lazy
>   areas, in case someone is in progress flushing them already.
> - Purge lock can be a mutex so we can sleep while that's going on.
>  
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
  Tested-by: Glauber Costa <[EMAIL PROTECTED]>
> ---
> Index: linux-2.6/mm/vmalloc.c
> ===
> --- linux-2.6.orig/mm/vmalloc.c
> +++ linux-2.6/mm/vmalloc.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -362,7 +363,7 @@ retry:
>   goto found;
>   }
>  
> - while (addr + size >= first->va_start && addr + size <= vend) {
> + while (addr + size > first->va_start && addr + size <= vend) {
>   addr = ALIGN(first->va_end + PAGE_SIZE, align);
>  
>   n = rb_next(&first->rb_node);
> @@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN
>  static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
>   int sync, int force_flush)
>  {
> - static DEFINE_SPINLOCK(purge_lock);
> + static DEFINE_MUTEX(purge_lock);
>   LIST_HEAD(valist);
>   struct vmap_area *va;
>   int nr = 0;
> @@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig
>* the case that isn't actually used at the moment anyway.
>*/
>   if (!sync && !force_flush) {
> - if (!spin_trylock(&purge_lock))
> + if (!mutex_trylock(&purge_lock))
>   return;
>   } else
> - spin_lock(&purge_lock);
> + mutex_lock(&purge_lock);
>  
>   rcu_read_lock();
>   list_for_each_entry_rcu(va, &vmap_area_list, list) {
> @@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig
>   __free_vmap_area(va);
>   spin_unlock(&vmap_area_lock);
>   }
> - spin_unlock(&purge_lock);
> + mutex_unlock(&purge_lock);
> +}
> +
> +/*
> + * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
> + * is already purging.
> + */
> +static void try_purge_vmap_area_lazy(void)
> +{
> + unsigned long start = ULONG_MAX, end = 0;
> +
> + __purge_vmap_area_lazy(&start, &end, 0, 0);
>  }
>  
>  /*
> @@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void)
>  {
>   unsigned long start = ULONG_MAX, end = 0;
>  
> - __purge_vmap_area_lazy(&start, &end, 0, 0);
> + __purge_vmap_area_lazy(&start, &end, 1, 0);
>  }
>  
>  /*
> @@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct 
>   va->flags |= VM_LAZY_FREE;
>   atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
>   if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages()))
> - purge_vmap_area_lazy();
> + try_purge_vmap_area_lazy();
>  }
>  
>  static struct vmap_area *find_vmap_area(unsigned long addr)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Greg KH
On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
 Well, to do it "correctly" you are going to have to tell the driver to
 shut itself down, and reinitialize itself.
 Turns out, that doesn't really work for disk and network devices without
 dropping the connection (well, network devices should be fine probably).
 So you just can't do this, sorry.  That's why the BIOS handles all of
 these issues in a PCI hotplug system.
 How does the hardware people think we are going to handle this in the
 OS?  It's not something that any operating system can do, is it part of
 the IOV PCI spec somewhere?
>>> No, it's not part of the PCI IOV spec.
>>>
>>> I just want the IOV (and whole PCI subsystem) have more flexibility on 
>>> various BIOSes. So can we reconsider about resource rebalance as boot 
>>> option, or should we forget about this idea?
>> As you have proposed it, the boot option will not work at all, so I
>> think we need to forget about it.  Especially if it is not really
>> needed.
>
> I guess at least one thing would work if people don't want to boot twice: 
> give the bus number 0 as rebalance starting point, then all system 
> resources would be reshuffled :-)

Hm, but don't we do that today with our basic resource reservation logic
at boot time?  What would be different about this kind of proposal?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Greg KH
On Fri, Nov 07, 2008 at 11:17:40PM +0800, Yu Zhao wrote:
> While we are arguing what the software model the SR-IOV should be, let me 
> ask two simple questions first:
>
> 1, What does the SR-IOV looks like?
> 2, Why do we need to support it?

I don't think we need to worry about those questions, as we can see what
the SR-IOV interface looks like by looking at the PCI spec, and we know
Linux needs to support it, as Linux needs to support everything :)

(note, community members that can not see the PCI specs at this point in
time, please know that we are working on resolving these issues,
hopefully we will have some good news within a month or so.)

> As you know the Linux kernel is the base of various virtual machine 
> monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support in 
> the kernel because mostly it helps high-end users (IT departments, HPC, 
> etc.) to share limited hardware resources among hundreds or even thousands 
> virtual machines and hence reduce the cost. How can we make these virtual 
> machine monitors utilize the advantage of SR-IOV without spending too much 
> effort meanwhile remaining architectural correctness? I believe making VF 
> represent as much closer as a normal PCI device (struct pci_dev) is the 
> best way in current situation, because this is not only what the hardware 
> designers expect us to do but also the usage model that KVM, Xen and other 
> VMMs have already supported.

But would such an api really take advantage of the new IOV interfaces
that are exposed by the new device type?

> I agree that API in the SR-IOV pacth is arguable and the concerns such as 
> lack of PF driver, etc. are also valid. But I personally think these stuff 
> are not essential problems to me and other SR-IOV driver developers.

How can the lack of a PF driver not be a valid concern at this point in
time?  Without such a driver written, how can we know that the SR-IOV
interface as created is sufficient, or that it even works properly?

Here's what I see we need to have before we can evaluate if the IOV core
PCI patches are acceptable:
  - a driver that uses this interface
  - a PF driver that uses this interface.

Without those, we can't determine if the infrastructure provided by the
IOV core even is sufficient, right?

Rumor has it that there is both of the above things floating around, can
someone please post them to the linux-pci list so that we can see how
this all works together?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: ppc: fix Kconfig constraints

2008-11-07 Thread Hollis Blanchard
On Fri, 2008-11-07 at 13:10 -0600, Hollis Blanchard wrote:
> Make sure that CONFIG_KVM cannot be selected without processor support
> (currently, 440 is the only processor implementation available).
> 
> Signed-off-by: Hollis Blanchard <[EMAIL PROTECTED]>
> 
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -15,24 +15,23 @@ if VIRTUALIZATION
>  if VIRTUALIZATION
> 
>  config KVM
> - bool "Kernel-based Virtual Machine (KVM) support"
> - depends on EXPERIMENTAL
> + bool
>   select PREEMPT_NOTIFIERS
>   select ANON_INODES
> + default n

The "default n" isn't needed. Updated patch below.

kvm: ppc: fix Kconfig constraints

Make sure that CONFIG_KVM cannot be selected without processor support
(currently, 440 is the only processor implementation available).

Signed-off-by: Hollis Blanchard <[EMAIL PROTECTED]>

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -15,24 +15,22 @@ if VIRTUALIZATION
 if VIRTUALIZATION
 
 config KVM
-   bool "Kernel-based Virtual Machine (KVM) support"
-   depends on EXPERIMENTAL
+   bool
select PREEMPT_NOTIFIERS
select ANON_INODES
+
+config KVM_440
+   bool "KVM support for PowerPC 440 processors"
+   depends on EXPERIMENTAL && 44x
+   select KVM
---help---
- Support hosting virtualized guest machines. You will also
- need to select one or more of the processor modules below.
+ Support running unmodified 440 guest kernels in virtual machines on
+ 440 host processors.
 
  This module provides access to the hardware capabilities through
  a character device node named /dev/kvm.
 
  If unsure, say N.
-
-config KVM_440
-   bool "KVM support for PowerPC 440 processors"
-   depends on KVM && 44x
-   ---help---
- KVM can run unmodified 440 guest kernels on 440 host processors.
 
 config KVM_TRACE
bool "KVM trace support"


-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2235570 ] 100% cpu usage with KVM-78

2008-11-07 Thread SourceForge.net
Bugs item #2235570, was opened at 2008-11-07 17:58
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2235570&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: James Bailey (dgym)
Assigned to: Nobody/Anonymous (nobody)
Summary: 100% cpu usage with KVM-78

Initial Comment:
When I start a guest it consumes 100% CPU on the host, even after it has booted 
and is sitting idle at a login prompt.

The odd thing is that if I then live migrate the guest to another (identical) 
machine, the problem goes away. The guest continues to run just fine on the new 
host, and the new host's CPU usage is normal.

I have tried the obvious: starting on the other machine and migrating to the 
first, and even multiple migrations. It is always the same, the 
qemu-system-x86_64 process sits at 100% unless it was started with -incoming ...

Migrating machines every time you start up is not a very convenient work 
around, so it would be nice to find out what is different between the normal 
start up and the -incoming start up and fix the former.

Versions and settings:
KVM: 78
Host Kernel: Vanilla 2.6.25.2
Compiled with: gcc version 4.1.2
CPU: AMD Phenom

Guest OS: Linux (have tried a few distros)
Guest Kernels: Debian etch, and an OpenVZ 2.6.18

Command line:
qemu-system-x86_64 -m 128 -smp 1 -drive file=/dev/drbd0 -vnc :1

Things I have tried which have not worked:
Using -nographics.
Using SDL graphics.
Using -snapshot, and doing a savevm and loadvm.

Things I have tried which have worked:
Using -no-kvm.

I have attached gdb and found the busy thread, here is its backtrace:
#0  0x7f06f017ea17 in ioctl () from /lib/libc.so.6
#1  0x0051b423 in kvm_run (kvm=0xa93040, vcpu=0) at libkvm.c:892
#2  0x004f1116 in kvm_cpu_exec (env=) at 
/opt/setup/kvm-78/qemu/qemu-kvm.c:230
#3  0x004f13e4 in ap_main_loop (_env=) at 
/opt/setup/kvm-78/qemu/qemu-kvm.c:432
#4  0x7f06f0565135 in start_thread () from /lib/libpthread.so.0
#5  0x7f06f01852ce in clone () from /lib/libc.so.6
#6  0x in ?? ()

Because this indicates business within the kernel module it is as far as I have 
got.

I will attempt to identify the previous working version, I know I never had 
this problem with 68, but I haven't yet tried anything in between.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2235570&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Yu Zhao

Anthony Liguori wrote:

Matthew Wilcox wrote:

[Anna, can you fix your word-wrapping please?  Your lines appear to be
infinitely long which is most unpleasant to reply to]

On Thu, Nov 06, 2008 at 05:38:16PM +, Fischer, Anna wrote:
 

Where would the VF drivers have to be associated?  On the "pci_dev"
level or on a higher one?
  

A VF appears to the Linux OS as a standard (full, additional) PCI
device. The driver is associated in the same way as for a normal PCI
device. Ideally, you would use SR-IOV devices on a virtualized system,
for example, using Xen. A VF can then be assigned to a guest domain as
a full PCI device.



It's not clear thats the right solution.  If the VF devices are _only_
going to be used by the guest, then arguably, we don't want to create
pci_devs for them in the host.  (I think it _is_ the right answer, but I
want to make it clear there's multiple opinions on this).
  


The VFs shouldn't be limited to being used by the guest.


Yes, VF driver running in the host is supported :-)



SR-IOV is actually an incredibly painful thing.  You need to have a VF 
driver in the guest, do hardware pass through, have a PV driver stub in 
the guest that's hypervisor specific (a VF is not usable on it's own), 
have a device specific backend in the VMM, and if you want to do live 
migration, have another PV driver in the guest that you can do teaming 
with.  Just a mess.


Actually not so mess. VF driver can be a plain PCI device driver and 
doesn't require any backend in the VMM, or hypervisor specific 
knowledge, if the hardware is properly designed. In this case PF driver 
controls hardware resource allocation for VFs and VF driver can work 
without any communication to PF driver or VMM.




What we would rather do in KVM, is have the VFs appear in the host as 
standard network devices.  We would then like to back our existing PV 
driver to this VF directly bypassing the host networking stack.  A key 
feature here is being able to fill the VF's receive queue with guest 
memory instead of host kernel memory so that you can get zero-copy 
receive traffic.  This will perform just as well as doing passthrough 
(at least) and avoid all that ugliness of dealing with SR-IOV in the guest.


If the hardware supports both SR-IOV and IOMMU, I wouldn't suggest 
people to do so, because they will get better performance by directly 
assigning VF to the guest.


However, lots of low-end machines don't have SR-IOV and IOMMU support. 
They may have multi queue NIC, which uses built-in L2 switch to dispense 
packets to different DMA queue according to MAC address. They definitely 
can benefit a lot if there is software support for the DMA queue hooking 
virtio-net backend as you suggested.




This eliminates all of the mess of various drivers in the guest and all 
the associated baggage of doing hardware passthrough.


So IMHO, having VFs be usable in the host is absolutely critical because 
I think it's the only reasonable usage model.


Please don't worry, we have take this usage model as well as container 
model into account when designing SR-IOV framework for the kernel.




Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Yu Zhao
While we are arguing what the software model the SR-IOV should be, let 
me ask two simple questions first:


1, What does the SR-IOV looks like?
2, Why do we need to support it?

I'm sure people have different understandings from their own view 
points. No one is wrong, but, please don't make thing complicated and 
don't ignore user requirements.


PCI SIG and hardware vendors create such thing intending to make 
hardware resource in one PCI device be shared from different software 
instances -- I guess all of us agree with this. No doubt PF is real 
function in the PCI device, but VF is different? No, it also has its own 
Bus, Device and Function numbers, and PCI configuration space and Memory 
Space (MMIO). To be more detailed, it can response to and initiate PCI 
Transaction Layer Protocol packets, which means it can do everything a 
PF can in PCI level. From these obvious behaviors, we can conclude PCI 
SIG model VF as a normal PCI device function, even it's not standalone.


As you know the Linux kernel is the base of various virtual machine 
monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support in 
the kernel because mostly it helps high-end users (IT departments, HPC, 
etc.) to share limited hardware resources among hundreds or even 
thousands virtual machines and hence reduce the cost. How can we make 
these virtual machine monitors utilize the advantage of SR-IOV without 
spending too much effort meanwhile remaining architectural correctness? 
I believe making VF represent as much closer as a normal PCI device 
(struct pci_dev) is the best way in current situation, because this is 
not only what the hardware designers expect us to do but also the usage 
model that KVM, Xen and other VMMs have already supported.


I agree that API in the SR-IOV pacth is arguable and the concerns such 
as lack of PF driver, etc. are also valid. But I personally think these 
stuff are not essential problems to me and other SR-IOV driver 
developers. People can refine things but don't want to recreate things 
in another totally different way especially that way doesn't bring them 
obvious benefits.


As I can see that we are now reaching a point that a decision must be 
made, I know this is such difficult thing in an open and free community 
but fortunately we have a lot of talented and experienced people here. 
So let's make it happen, and keep our loyal users happy!


Thanks,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Andi Kleen
Anthony Liguori <[EMAIL PROTECTED]> writes:
>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices.  We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack.  A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic.  This will perform just as well as doing passthrough
> (at least) and avoid all that ugliness of dealing with SR-IOV in the
> guest.

But you shift a lot of ugliness into the host network stack again.
Not sure that is a good trade off.

Also it would always require context switches and I believe one
of the reasons for the PV/VF model is very low latency IO and having
heavyweight switches to the host and back would be against that.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git repository for SR-IOV development?

2008-11-07 Thread Yu Zhao

Hello Lance,

Thanks for your interest in SR-IOV. As Greg said we can't have a git 
tree for the change, but you are welcome to ask any question here and I 
also will keep you informed if there is any update on the SR-IOV patches.


Thanks,
Yu

Greg KH wrote:

On Thu, Nov 06, 2008 at 11:58:25AM -0800, H L wrote:

--- On Thu, 11/6/08, Greg KH <[EMAIL PROTECTED]> wrote:


On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:

Has anyone initiated or given consideration to the

creation of a git

repository (say, on kernel.org) for SR-IOV

development?

Why?  It's only a few patches, right?  Why would it
need a whole new git
tree?


So as to minimize the time and effort patching a kernel, especially if
the tree (and/or hash level) against which the patches were created
fails to be specified on a mailing-list.  Plus, there appears to be
questions raised on how, precisely, the implementation should
ultimately be modeled and especially given that, who knows at this
point what number of patches will ultimately be submitted?  I know
I've built the "7-patch" one (painfully, by the way), and I'm aware
there's another 15-patch set out there which I've not yet examined.


It's a mere 7 or 15 patches, you don't need a whole git tree for
something small like that.

Especially as there only seems to be one developer doing real work...

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Fix kvm_free_physmem_slot memory leak.

2008-11-07 Thread François Diakhate
On Thu, Nov 6, 2008 at 4:14 PM, Avi Kivity <[EMAIL PROTECTED]> wrote:
> What happens here if the both free and dont have nonzero, differnt
> ->userspace_addr values?  Is is even possible?

I dont think it can happen in the current kvm code, but I put that test in
order to respect the function behaviour of freeing any memory allocation
pointed to by free and not by dont (as described in the comment).

> Also, the call chain is fishy.  set_memory_region calls free_physmem_slot
> which calls arch_set_memory_region.  This is turning into pasta.

I agree, that's why I thought it would be better to put this outside
kvm_free_physmem_slot in my first patch. AFAICT, kvm_free_physmem_slot
is called by kvm_set_memory_region in order to free the memory holding
information regarding the slot but not the actual memory region held
by the slot: precisely because it is the role of kvm_set_memory_region
to free it.

So here is an attempt at something cleaner:
1 - Rename kvm_free_physmem_slot to kvm_free_physmem_slot_info
to indicate that it only frees the memory storing information about the slot
and not the memory region.

2- Make kvm_free_physmem free memory regions through
kvm_set_memory_region and let it free the slot info.


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a87f45e..e59dc10 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -614,11 +614,24 @@ out:
return kvm;
 }

+static void kvm_free_physmem_slot(struct kvm *kvm,
+ struct kvm_memory_slot *slot)
+{
+   struct kvm_userspace_memory_region mem = {
+   .slot = memslot_id(kvm, slot),
+   .guest_phys_addr = slot->base_gfn << PAGE_SHIFT,
+   .memory_size = 0,
+   .flags = 0,
+   };
+
+   kvm_set_memory_region(kvm, &mem, slot->user_alloc);
+}
+
 /*
  * Free any memory in @free but not in @dont.
  */
-static void kvm_free_physmem_slot(struct kvm_memory_slot *free,
- struct kvm_memory_slot *dont)
+static void kvm_free_physmem_slot_info(struct kvm_memory_slot *free,
+  struct kvm_memory_slot *dont)
 {
if (!dont || free->rmap != dont->rmap)
vfree(free->rmap);
@@ -640,7 +653,7 @@ void kvm_free_physmem(struct kvm *kvm)
int i;

for (i = 0; i < kvm->nmemslots; ++i)
-   kvm_free_physmem_slot(&kvm->memslots[i], NULL);
+   kvm_free_physmem_slot(kvm, &kvm->memslots[i]);
 }

 static void kvm_destroy_vm(struct kvm *kvm)
@@ -745,10 +758,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out_free;
}

-   /* Free page dirty bitmap if unneeded */
+   /* Free any unneeded data */
if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES))
new.dirty_bitmap = NULL;

+   if (!npages) {
+   new.rmap = NULL;
+   new.lpage_info = NULL;
+   }
r = -ENOMEM;

/* Allocate if a slot is being created */
@@ -821,7 +838,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out_free;
}

-   kvm_free_physmem_slot(&old, &new);
+   kvm_free_physmem_slot_info(&old, &new);
 #ifdef CONFIG_DMAR
/* map the pages in iommu page table */
r = kvm_iommu_map_pages(kvm, base_gfn, npages);
@@ -831,7 +848,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
return 0;

 out_free:
-   kvm_free_physmem_slot(&new, &old);
+   kvm_free_physmem_slot_info(&new, &old);
 out:
return r;

Also, I've been reading a bit more about the linux mm and I now think
that to be able to use kvm->mm in arch_set_memory_region we need
to increase mm_users instead of mm_count. However, if we do that,
since memory maps wont be cleared when the process exits the kvm
fds which are still mapped in userspace will not be released so we will
have a bigger memory leak.
Any ideas on how to fix this properly?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] restart search at beggining of vmalloc address

2008-11-07 Thread Glauber Costa
On Fri, Nov 07, 2008 at 01:27:42AM +0100, Nick Piggin wrote:
> Excellent, thank you! Good catch
> 
> If you agree my previous patch was also good in combination with this one,
> then I'll send them all to be merged.
I'll test it with conjunction with my patch to be sure it does not retrigger
the bug I had. But I'm sure it won't (famous last words)

I also had two more patches that I wrote that aided me in debugging this, that
I'd like to see included:
 * the first shows the size of the failing allocation: We can see current 
allocations
   with /proc/vmallocinfo, but it's hard to get a grasp at what's the size of 
the allocation
   that just failed, because it is not registered.
 * the second, shows the real name of the callers in vmallocinfo, instead of 
things like
   "vmalloc_32", which is just an intermediate.

My plan was to send them today, after getting comments from you on this one.
If you think they are all reasonable, maybe we can send them all as a series.

> 
> Thanks,
> Nick
> 
> On Thu, Nov 06, 2008 at 06:58:26PM -0200, Glauber Costa wrote:
> > Current vmalloc restart search for a free area in case we
> > can't find one. The reason is there are areas which are lazily
> > freed, and could be possibly freed now. However, current implementation
> > start searching the tree from the last failing address, which is
> > pretty much by definition at the end of address space. So, we fail.
> > 
> > The proposal of this patch is to restart the search from the beginning
> > of the requested vstart address. This fixes the regression in running
> > KVM virtual machines for me, described in
> > http://lkml.org/lkml/2008/10/28/349, caused by commit
> > db64fe02258f1507e13fe5212a989922323685ce.
> > 
> > Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
> > CC: Nick Piggin <[EMAIL PROTECTED]>
> > ---
> >  mm/vmalloc.c |1 +
> >  1 files changed, 1 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 7db493d..6fe2003 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -378,6 +378,7 @@ found:
> > if (!purged) {
> > purge_vmap_area_lazy();
> > purged = 1;
> > +   addr = ALIGN(vstart, align);
> > goto retry;
> > }
> > if (printk_ratelimit())
> > -- 
> > 1.5.6.5
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Direct kernel boot without harddrive image

2008-11-07 Thread Daire Byrne
Hi,

Admittedly this looks more like a Qemu issue but it effects KVM too which is 
all I'm interested in. I have a kernel and initrd which contains a mini busybox 
root environment which I want to boot directly using -kernel and -initrd. Now 
this works fine on first start but if I reboot within the system QEMU crashes 
out on the next run. I used "-hda /dev/zero" instead of a HD image. When qemu 
crashes after the reboot it dumps something like this:

[EMAIL PROTECTED] ~]# /usr/bin/qemu-kvm -m 1024 -smp 1 -name fedora3 -kernel 
vmlinuz-current -initrd initrd-diskless.img -append 'init=/init 
ramdisk_size=65536 root=/dev/ram0 rw' -hda /dev/zero
qemu: loading initrd (0x282bde bytes) at 0x1fd7d000
exception 13 (33)
rax b141 rbx 0100 rcx  rdx 
0100
rsi  rdi  rsp fff2 rbp 

r8  r9  r10  r11 

r12  r13  r14  r15 

rip 002c rflags 00033017
cs 1020 (00010200/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ds  (/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
es 1000 (0001/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss 1000 (0001/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
fs 1000 (0001/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
gs  (/ p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
tr 0080 (fffbd000/2088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt  (/ p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt fb812/30
idt 0/3ff
cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
code: 00 d0 d7 1f de 2b 28 00 00 00 00 00 00 fe 00 00 00 00 02 00 --> ff ff ff 
1f e8 cd 0c eb 0b 90 90 90 90 90 90 90 90 90 90 90 00 00 00 00 00 00 00 00 00 00

Is direct kernel booting just not really supported properly or is it just to do 
with Qemu forgetting about the direct boot kernel/initrd after a reboot?

Daire
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Userspace: Make device-assignment work for kvm/ia64.

2008-11-07 Thread Zhang, Xiantao
>From 45b40eecff85b9a7ae4caf4ae184905a79e5a139 Mon Sep 17 00:00:00 2001
From: Xiantao Zhang <[EMAIL PROTECTED]>
Date: Fri, 7 Nov 2008 18:13:13 +0800
Subject: [PATCH] KVM: Userspace: Make device-assignment work for kvm/ia64.

kvm/ia64 have supported vt-d from 2.6.28rc1, this patch
enables its userspace's support.

Signed-off-by: Xiantao Zhang <[EMAIL PROTECTED]>
---
 kernel/ia64/Kbuild  |4 
 qemu/Makefile.target|7 ---
 qemu/hw/device-assignment.c |4 
 qemu/hw/ipf.c   |   27 +--
 qemu/hw/pc.h|2 ++
 qemu/vl.c   |8 
 6 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/kernel/ia64/Kbuild b/kernel/ia64/Kbuild
index e9660ba..88eaa39 100644
--- a/kernel/ia64/Kbuild
+++ b/kernel/ia64/Kbuild
@@ -3,6 +3,10 @@ obj-m := kvm.o kvm-intel.o
 kvm-objs := kvm_main.o ioapic.o coalesced_mmio.o kvm-ia64.o kvm_fw.o \
irq_comm.o ../anon_inodes.o ../external-module-compat.o
 
+ifeq ($(CONFIG_DMAR),y)
+kvm-objs += vtd.o
+endif
+
 EXTRA_CFLAGS_vcpu.o += -mfixed-range=f2-f5,f12-f127
 kvm-intel-objs := vmm.o vmm_ivt.o trampoline.o vcpu.o optvfault.o mmio.o \
vtlb.o process.o memset.o memcpy.o
diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index d504b75..229b1c6 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -720,15 +720,16 @@ OBJS += virtio.o virtio-net.o virtio-blk.o 
virtio-balloon.o
 
 OBJS += device-hotplug.o
 
+ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1)
+OBJS+= device-assignment.o
+endif
+
 ifeq ($(TARGET_BASE_ARCH), i386)
 # Hardware support
 OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
 OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
 OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
 OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
-ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1)
-OBJS+= device-assignment.o
-endif
 ifeq ($(USE_KVM_PIT), 1)
 OBJS+= i8254-kvm.o
 endif
diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 78b7e14..5cda7d9 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -457,6 +457,10 @@ void assigned_dev_update_irq(PCIDevice *d)
 irq = pci_map_irq(&assigned_dev->dev, assigned_dev->intpin);
 irq = piix_get_irq(irq);
 
+#ifdef TARGET_IA64
+   irq = ipf_map_irq(d, irq);
+#endif
+
 if (irq != assigned_dev->girq) {
 struct kvm_assigned_irq assigned_irq_data;
 
diff --git a/qemu/hw/ipf.c b/qemu/hw/ipf.c
index 337c854..f4a2853 100644
--- a/qemu/hw/ipf.c
+++ b/qemu/hw/ipf.c
@@ -38,6 +38,7 @@
 #include "firmware.h"
 #include "ia64intrin.h"
 #include 
+#include "device-assignment.h"
 
 #include "qemu-kvm.h"
 
@@ -450,7 +451,6 @@ static void ipf_init1(ram_addr_t ram_size, int vga_ram_size,
 
 /*Load firware to its proper position.*/
 if (kvm_enabled()) {
-int r;
 unsigned long  image_size;
 char *image = NULL;
 uint8_t *fw_image_start;
@@ -645,8 +645,26 @@ static void ipf_init1(ram_addr_t ram_size, int 
vga_ram_size,
unit_id++;
}
 }
-}
 
+#ifdef USE_KVM_DEVICE_ASSIGNMENT
+if (kvm_enabled()) {
+   int i;
+for (i = 0; i < assigned_devices_index; i++) {
+if (add_assigned_device(assigned_devices[i]) < 0) {
+fprintf(stderr, "Warning: could not add assigned device %s\n",
+assigned_devices[i]);
+}
+}
+
+   if (init_all_assigned_devices(pci_bus)) {
+   fprintf(stderr, "Failed to initialize assigned devices\n");
+   exit (1);
+   }
+}
+#endif /* USE_KVM_DEVICE_ASSIGNMENT */
+
+}
+ 
 static void ipf_init_pci(ram_addr_t ram_size, int vga_ram_size,
  const char *boot_device, DisplayState *ds,
  const char *kernel_filename,
@@ -695,3 +713,8 @@ void ioapic_set_irq(void *opaque, int irq_num, int level)
return;
 }
 }
+
+int ipf_map_irq(PCIDevice *pci_dev, int irq_num)
+{
+   return ioapic_map_irq(pci_dev->devfn, irq_num);
+}
diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index fddfdfa..074195d 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -118,6 +118,8 @@ int piix4_init(PCIBus *bus, int devfn);
 
 int piix_get_irq(int pin);
 
+int ipf_map_irq(PCIDevice *pci_dev, int irq_num);
+
 /* vga.c */
 enum vga_retrace_method {
 VGA_RETRACE_DUMB,
diff --git a/qemu/vl.c b/qemu/vl.c
index 561e254..c383587 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -8694,7 +8694,7 @@ static void help(int exitcode)
 #endif
   "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n"
   "-no-kvm-pit disable KVM kernel mode PIT\n"
-#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__)
+#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(TARGET_IA64) || 
defined(__linux__)
"-pcidevice host=bus:dev.func[,dma=none][,name=string]\n"
"expose a

Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Zhao, Yu

Greg KH wrote:

On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:

Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.
Turns out, that doesn't really work for disk and network devices without
dropping the connection (well, network devices should be fine probably).
So you just can't do this, sorry.  That's why the BIOS handles all of
these issues in a PCI hotplug system.
How does the hardware people think we are going to handle this in the
OS?  It's not something that any operating system can do, is it part of
the IOV PCI spec somewhere?

No, it's not part of the PCI IOV spec.

I just want the IOV (and whole PCI subsystem) have more flexibility on 
various BIOSes. So can we reconsider about resource rebalance as boot 
option, or should we forget about this idea?


As you have proposed it, the boot option will not work at all, so I
think we need to forget about it.  Especially if it is not really
needed.


I guess at least one thing would work if people don't want to boot 
twice: give the bus number 0 as rebalance starting point, then all 
system resources would be reshuffled :-)


Thanks,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Greg KH
On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>> Well, to do it "correctly" you are going to have to tell the driver to
>> shut itself down, and reinitialize itself.
>> Turns out, that doesn't really work for disk and network devices without
>> dropping the connection (well, network devices should be fine probably).
>> So you just can't do this, sorry.  That's why the BIOS handles all of
>> these issues in a PCI hotplug system.
>> How does the hardware people think we are going to handle this in the
>> OS?  It's not something that any operating system can do, is it part of
>> the IOV PCI spec somewhere?
>
> No, it's not part of the PCI IOV spec.
>
> I just want the IOV (and whole PCI subsystem) have more flexibility on 
> various BIOSes. So can we reconsider about resource rebalance as boot 
> option, or should we forget about this idea?

As you have proposed it, the boot option will not work at all, so I
think we need to forget about it.  Especially if it is not really
needed.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Zhao, Yu

Greg KH wrote:

On Fri, Nov 07, 2008 at 03:50:34PM +0800, Zhao, Yu wrote:

Greg KH wrote:

On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:

Greg KH wrote:

On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:

Greg KH wrote:

This seems like a big problem.  How are we going to know to add these
command line options for devices we haven't even seen/known about yet?
How do we know the bus ids aren't going to change between boots (hint,
they are, pci bus ids change all the time...)
We need to be able to do this kind of thing dynamically, not fixed at
boot time, which seems way to early to even know about this, right?
thanks,
greg k-h

Yes, I totally agree. Doing things dynamically is better.

The purpose of these parameters is to rebalance and align resources for 
device that has BARs encapsulated in various new capabilities (SR-IOV, 
etc.), because most of existing BIOSes don't take care of those BARs.

But how are you going to know what the proper device ids are going to
be before the machine boots?  I don't see how these options are ever
going to work properly for a "real" user.
If we do resource rebalance after system is up, do you think there is 
any side effect or impact to other subsystem other than PCI (e.g. 
MTRR)?

I don't think so.
I haven't had much thinking on the dynamical resource rebalance. If you 
have any idea about this, can you please suggest?

Yeah, it's going to be hard :)
We've thought about this in the past, and even Microsoft said it was
going to happen for Vista, but they realized in the end, like we did a
few years previously, that it would require full support of all PCI
drivers as well (if you rebalance stuff that is already bound to a
driver.)  So they dropped it.
When would you want to do this kind of rebalancing?  Before any PCI
driver is bound to any devices?  Or afterwards?
I guess if we want the rebalance dynamic, then we should have it full -- 
the rebalance would be functional even after the driver is loaded.


But in most cases, there will be problem when we unload driver from a 
hard disk controller, etc. We can mount root on a ramdisk and do the 
rebalance there, but it's complicated for a real user.


So looks like doing rebalancing before any driver is bound to any device 
is also a nice idea, if user can get a shell to do rebalance before 
built-in PCI driver grabs device.

That's not going to work, it needs to happen before any PCI device is
bound, which is before init runs.
I don't think it can work either. Then we have to do rebalance after the 
driver bounding. But what should we do if we can't unload the driver (hard 
disk controller, etc.)?


Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.

Turns out, that doesn't really work for disk and network devices without
dropping the connection (well, network devices should be fine probably).

So you just can't do this, sorry.  That's why the BIOS handles all of
these issues in a PCI hotplug system.

How does the hardware people think we are going to handle this in the
OS?  It's not something that any operating system can do, is it part of
the IOV PCI spec somewhere?


No, it's not part of the PCI IOV spec.

I just want the IOV (and whole PCI subsystem) have more flexibility on 
various BIOSes. So can we reconsider about resource rebalance as boot 
option, or should we forget about this idea?


Regards,
Yu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

2008-11-07 Thread Greg KH
On Fri, Nov 07, 2008 at 03:50:34PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
>>> Greg KH wrote:
 On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>>  Documentation/kernel-parameters.txt |   10 ++
>>>  1 files changed, 10 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/Documentation/kernel-parameters.txt 
>>> b/Documentation/kernel-parameters.txt
>>> index 53ba7c7..5482ae0 100644
>>> --- a/Documentation/kernel-parameters.txt
>>> +++ b/Documentation/kernel-parameters.txt
>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is 
>>> defined in the file
>>> cbmemsize=nn[KMG]   The fixed amount of bus space 
>>> which is
>>> reserved for the CardBus bridge's memory
>>> window. The default value is 64 
>>> megabytes.
>>> +   assign-mmio=[:]bb   [X86] reassign memory resources 
>>> of all
>>> +   devices under bus [:]bb ( is 
>>> the domain
>>> +   number and bb is the bus number).
>>> +   assign-pio=[:]bb[X86] reassign io port 
>>> resources of all
>>> +   devices under bus [:]bb ( is 
>>> the domain
>>> +   number and bb is the bus number).
>>> +   align-mmio=[:]bb:dd.f  [X86] relocate memory 
>>> resources of a
>>> +   device to minimum PAGE_SIZE alignment 
>>> ( is
>>> +   the domain number and bb, dd and f is 
>>> the bus,
>>> +   device and function number).
>> This seems like a big problem.  How are we going to know to add these
>> command line options for devices we haven't even seen/known about yet?
>> How do we know the bus ids aren't going to change between boots (hint,
>> they are, pci bus ids change all the time...)
>> We need to be able to do this kind of thing dynamically, not fixed at
>> boot time, which seems way to early to even know about this, right?
>> thanks,
>> greg k-h
> Yes, I totally agree. Doing things dynamically is better.
>
> The purpose of these parameters is to rebalance and align resources for 
> device that has BARs encapsulated in various new capabilities (SR-IOV, 
> etc.), because most of existing BIOSes don't take care of those BARs.
 But how are you going to know what the proper device ids are going to
 be before the machine boots?  I don't see how these options are ever
 going to work properly for a "real" user.
> If we do resource rebalance after system is up, do you think there is 
> any side effect or impact to other subsystem other than PCI (e.g. 
> MTRR)?
 I don't think so.
> I haven't had much thinking on the dynamical resource rebalance. If you 
> have any idea about this, can you please suggest?
 Yeah, it's going to be hard :)
 We've thought about this in the past, and even Microsoft said it was
 going to happen for Vista, but they realized in the end, like we did a
 few years previously, that it would require full support of all PCI
 drivers as well (if you rebalance stuff that is already bound to a
 driver.)  So they dropped it.
 When would you want to do this kind of rebalancing?  Before any PCI
 driver is bound to any devices?  Or afterwards?
>>> I guess if we want the rebalance dynamic, then we should have it full -- 
>>> the rebalance would be functional even after the driver is loaded.
>>>
>>> But in most cases, there will be problem when we unload driver from a 
>>> hard disk controller, etc. We can mount root on a ramdisk and do the 
>>> rebalance there, but it's complicated for a real user.
>>>
>>> So looks like doing rebalancing before any driver is bound to any device 
>>> is also a nice idea, if user can get a shell to do rebalance before 
>>> built-in PCI driver grabs device.
>> That's not going to work, it needs to happen before any PCI device is
>> bound, which is before init runs.
>
> I don't think it can work either. Then we have to do rebalance after the 
> driver bounding. But what should we do if we can't unload the driver (hard 
> disk controller, etc.)?

Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.

Turns out, that doesn't really work for disk and network devices without
dropping the connection (well, network devices should be fine probably).

So you just can't do this, sorry.  That's why the BIOS handles all of
these issues in a PCI hotplug system.

How does the hardware people think we are going to handle