Re: kvm-78 - kernel panic after using system_reset except when using -no-kvm-irqchip

2008-11-11 Thread Jan Kiszka
Charles Duffy wrote:
> This happens every other use of system_reset -- ie. resetting the system
>  using system_reset while in this panic'd state results in correct
> functionality. Following the suggestion and booting the guest with
> noapic results in other IRQ-related errors.
> 
> The guest kernel is CentOS 5 2.6.18-53.el5 on x86_64. The host is
> running 2.6.27.5.
> 
> Kernel command line: ro root=/dev/VolGroup01/LogVol00
> Initializing CPU#0
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
> Checking aperture...
> Memory: 746368k/76936k available (2434k kernel code, 21180k reserved,
> 1235k data, 192k init)
> Calibrating delay using timer specific routine.. 4793.50 BogoMIPS
> (lpj=2396750)
> Security Framework v1.0.0 initialized
> SELinux:  Initializing.
> selinux_register_security:  Registering secondary module capability
> Capability LSM initialized as secondary
> Mount-cache hash table entries: 256
> CPU: L1 I cache: 32K, L1 D cache: 32K
> CPU: L2 cache: 2048K
> SMP alternatives: switching to UP code
> ACPI: Core revision 20060707
> irq 25, desc: 803afc80, depth: 1, count: 0, unhandled: 0
> ->handle_irq():  800b54e3, handle_bad_irq+0x0/0x1f6
> ->chip(): 802ea700, 0x802ea700
> ->action(): 
>   IRQ_DISABLED set
> unexpected IRQ trap at vector 19
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> timer doesn't work through the IO-APIC - disabling NMI Watchdog!
> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
> 'noapic' kernel parameter

Still too early for me, so I didn't get yet if you can trigger this
guest panic reliably or only sporadically (like I can). In the former
case I would be very interested in the how!

So far my theory on this is that the guest happens to loose too many PIC
timer ticks during the test window where it checks the PIC IRQ routing
(it does so with 10 ticks, out of which at least 4 must be delivered).
The theory also says that this is hard to avoid, maybe just less likely
with Gleb's timer drift compensation patches.

Jan

PS: Booting with noapic leaves many real boxes useless as well.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> So are you checking if there's an unresolved reference only in the
> very place I just quoted in the previous email? If answer is yes: what
> should prevent get_user_pages from running in parallel from another
> thread? get_user_pages will trigger a minor fault and get the elevated
> reference just after you read page_count. To you it looks like there
> is no o_direct in progress when you proceed to the core of migration
> code, but in effect o_direct just started a moment after you read the
> page count.

get_user_pages() cannot get to it since the pagetables have already been
modified. If get_user_pages runs then the fault handling will occur
which will block the thread until migration is complete.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm-78 - kernel panic after using system_reset except when using -no-kvm-irqchip

2008-11-11 Thread Charles Duffy
This happens every other use of system_reset -- ie. resetting the system 
 using system_reset while in this panic'd state results in correct 
functionality. Following the suggestion and booting the guest with 
noapic results in other IRQ-related errors.


The guest kernel is CentOS 5 2.6.18-53.el5 on x86_64. The host is 
running 2.6.27.5.


Kernel command line: ro root=/dev/VolGroup01/LogVol00
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Checking aperture...
Memory: 746368k/76936k available (2434k kernel code, 21180k reserved, 
1235k data, 192k init)
Calibrating delay using timer specific routine.. 4793.50 BogoMIPS 
(lpj=2396750)

Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
SMP alternatives: switching to UP code
ACPI: Core revision 20060707
irq 25, desc: 803afc80, depth: 1, count: 0, unhandled: 0
->handle_irq():  800b54e3, handle_bad_irq+0x0/0x1f6
->chip(): 802ea700, 0x802ea700
->action(): 
  IRQ_DISABLED set
unexpected IRQ trap at vector 19
..MP-BIOS bug: 8254 timer not connected to IO-APIC
timer doesn't work through the IO-APIC - disabling NMI Watchdog!
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
'noapic' kernel parameter


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 06:27:09PM -0600, Christoph Lameter wrote:
> Then page migration will not occur because there is an unresolved
> reference.

So are you checking if there's an unresolved reference only in the
very place I just quoted in the previous email? If answer is yes: what
should prevent get_user_pages from running in parallel from another
thread? get_user_pages will trigger a minor fault and get the elevated
reference just after you read page_count. To you it looks like there
is no o_direct in progress when you proceed to the core of migration
code, but in effect o_direct just started a moment after you read the
page count.

What can protect you is PG lock or mmap_sem in _write_ mode (and
they've to be hold for the whole duration of the migration). I don't
see any of the two being hold while you read the page count... You
don't seem to be using stop_machine either (stop_machine pretty
expensive on the 4096 way I guess).

This wasn't reproduced in practice but it should be possible to
reproduce it by just writing a testcase with three threads, one forks
in a loop (child just quit) the other memset 0 the first 512bytes of a
page, and then o_direct read from a 0xff 512byte region and checks
that the first 512bytes are all non-zero in a loop, and the third
writes 1 byte to the last 512bytes of the page in a loop. Eventually
the comparison should show zero data in the page.

To reproduce with migration just start the thread that memset 0, reads
a 0xff region with o_direct, and checks it's all 0xff in a loop, and
then migrate the memory of this thread back and forth between two
nodes with the sys_move_pages (mpol is safe by luck because it
surrounds migrate_pages with the mmap_sem in write mode). Eventually
you should see zero bytes despite I/O is complete.

Reproducing this is normal life would take time and for the fork bug
it may not be reproducible depending of what the app is doing. Mixing
sys_move_pages with o_direct in the same process with on two different
threads, instead should eventually eventually reproduce it. And with
gup_fast is now unfixable until more infrastructure is added to
slowdown gup_fast a bit (unless Nick finds an RCU way of doing it).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread KAMEZAWA Hiroyuki
On Tue, 11 Nov 2008 23:24:21 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > > ksm need the pte inside the vma to point from anonymous page into 
> > > filebacked
> > > page
> > > can migrate.c do it without changes?
> > 
> > So change anonymous to filebacked page?
> >
> > Currently page migration assumes that the page will continue to be part
> > of the existing file or anon vma.
> > 
> > What you want sounds like assigning a swap pte to an anonymous page? That
> > way a anon page gains membership in a file backed mapping.
> 
> KSM needs to convert anonymous pages to PageKSM, which means a page
> owned by ksm.c and only known by ksm.c. The Linux VM will free this
> page in munmap but that's about it, all we do is to match the number
> of anon-ptes pointing to the page with the page_count. So besides
> freeing the page when the last user exit()s or cows it, the VM will do
> nothing about it. Initially. Later it can swap it in a nonlinear way.
> 
Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
  - charge(+1) when it's mapped firstly (mapcount 0->1)
  - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap().

My quesion is
 - PageKSM pages are not necessary to be tracked by memory cgroup ?
 - Can we know that "the page is just replaced and we don't necessary to do
   charge/uncharge".
 - annonymous page from KSM is worth to be tracked by memory cgroup ?
   (IOW, it's on LRU and can be swapped-out ?)

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: Fix kernel allocated memory slot

2008-11-11 Thread Anthony Liguori

Sheng Yang wrote:

On Tuesday 11 November 2008 22:58:25 Hollis Blanchard wrote:
  

On Tue, 2008-11-11 at 15:30 +0800, Sheng Yang wrote:


Commit 7fd49de9773fdcb7b75e823b21c1c5dc1e218c14 "KVM: ensure that memslot
userspace addresses are page-aligned" broke kernel space allocated memory
slot, for the userspace_addr is invalid.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 virt/kvm/kvm_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0a0a959..4727c08 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -715,7 +715,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
goto out;
-   if (mem->userspace_addr & (PAGE_SIZE - 1))
+   if (user_alloc && (mem->userspace_addr & (PAGE_SIZE - 1)))
goto out;
if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
goto out;
  

Wow, I didn't realize we still had kernel-allocated guest memory. Why is
that?

Anyways, the above patch seems fine to me.
Acked-by: Hollis Blanchard <[EMAIL PROTECTED]>



it's for VMX EPT and APIC access page, as well as TSS pages. We are planning 
to change that in the future, but I think it's a quick fix for now.
  


It's also there to support older userspaces that still rely on the 
kernel allocating guest memory.  This is only applicable to x86 though.


Regards,

Anthony Liguori


Thanks.

  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: Fix kernel allocated memory slot

2008-11-11 Thread Sheng Yang
On Tuesday 11 November 2008 22:58:25 Hollis Blanchard wrote:
> On Tue, 2008-11-11 at 15:30 +0800, Sheng Yang wrote:
> > Commit 7fd49de9773fdcb7b75e823b21c1c5dc1e218c14 "KVM: ensure that memslot
> > userspace addresses are page-aligned" broke kernel space allocated memory
> > slot, for the userspace_addr is invalid.
> >
> > Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
> > ---
> >  virt/kvm/kvm_main.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 0a0a959..4727c08 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -715,7 +715,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > goto out;
> > if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> > goto out;
> > -   if (mem->userspace_addr & (PAGE_SIZE - 1))
> > +   if (user_alloc && (mem->userspace_addr & (PAGE_SIZE - 1)))
> > goto out;
> > if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
> > goto out;
>
> Wow, I didn't realize we still had kernel-allocated guest memory. Why is
> that?
>
> Anyways, the above patch seems fine to me.
> Acked-by: Hollis Blanchard <[EMAIL PROTECTED]>

it's for VMX EPT and APIC access page, as well as TSS pages. We are planning 
to change that in the future, but I think it's a quick fix for now.

Thanks.

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> > O_DIRECT does not take a refcount on the page in order to prevent this?
>
> It definitely does, it's also the only thing it does.

Then page migration will not occur because there is an unresolved
reference.

> The whole point is that O_DIRECT can start the instruction after
> page_count returns as far as I can tell.

But there must still be reference for the bio and whatever may be going on
at the time in order to perform the I/O operation.

> If you check the three emails I linked in answer to Andrew on the
> topic, we agree the o_direct can't start under PT lock (or under
> mmap_sem in write mode but migrate.c rightefully takes the read
> mode). So the fix used in ksm page_wrprotect and in fork() is to check
> page_count vs page_mapcount inside PT lock before doing anything on
> the pte. If you just mark the page wprotect while O_DIRECT is in
> flight, that's enough for fork() to generate data corruption in the
> parent (not the child where the result would be undefined). But in the
> parent the result of the o-direct is defined and it'd never corrupt if
> this was a cached-I/O. The moment the parent pte is marked readonly, a thread
> in the parent could write to the last 512bytes of the page, leading to
> the first 512bytes coming with O_DIRECT from disk being lost (as the
> write will trigger a cow before I/O is complete and the dma will
> complete on the oldpage).

Have you actually seen corruption or this conjecture? AFACT the page
count is elevated while I/O is in progress and thus this is safe.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote:
> We don't have to check the page_count vs mapcount later in
> replace_page because we know if anybody started an O_DIRECT read from
> disk, it would have triggered a cow, and the pte_same check that we
> have to do for other reasons would take care of bailing out the
> replace_page.

Ah, for completeness: above I didn't mention the case of O_DIRECT
writes to disk, because we never need to care about those. They're
never a problem. If the page is replaced and the cpu writes to the
page and by doing so triggers a cow that lead to the CPU write to go
in a different page (not the one under dma) it'll be like if the write
to disk completed before the cpu overwritten the page, so result is
undefined. I don't think we've to define the case of somebody doing a
direct read from a location where there's still an o_direct write in
flight either.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
> 
> > this page_count check done with only the tree_lock won't prevent a
> > task to start O_DIRECT after page_count has been read in the above line.
> >
> > If a thread starts O_DIRECT on the page, and the o_direct is still in
> > flight by the time you copy the page to the new page, the read will
> > not be represented fully in the newpage leading to userland data
> > corruption.
> 
> O_DIRECT does not take a refcount on the page in order to prevent this?

It definitely does, it's also the only thing it does.

The whole point is that O_DIRECT can start the instruction after
page_count returns as far as I can tell.

If you check the three emails I linked in answer to Andrew on the
topic, we agree the o_direct can't start under PT lock (or under
mmap_sem in write mode but migrate.c rightefully takes the read
mode). So the fix used in ksm page_wrprotect and in fork() is to check
page_count vs page_mapcount inside PT lock before doing anything on
the pte. If you just mark the page wprotect while O_DIRECT is in
flight, that's enough for fork() to generate data corruption in the
parent (not the child where the result would be undefined). But in the
parent the result of the o-direct is defined and it'd never corrupt if
this was a cached-I/O. The moment the parent pte is marked readonly, a thread
in the parent could write to the last 512bytes of the page, leading to
the first 512bytes coming with O_DIRECT from disk being lost (as the
write will trigger a cow before I/O is complete and the dma will
complete on the oldpage).

We do the check in page_wprotect because that's the _first_ place
where we mark a pte readonly. Same as fork. And we can't convert a pte
from writeable to wrprotected without doing this check inside PT lock
or we generate data corruption with o_direct (again same as the bug in
fork).

We don't have to check the page_count vs mapcount later in
replace_page because we know if anybody started an O_DIRECT read from
disk, it would have triggered a cow, and the pte_same check that we
have to do for other reasons would take care of bailing out the
replace_page.

> Oh they could be migrated if you had a callback to the devices method for
> giving up references. Same as slab defrag.

The oldpage is a regular anonymous page for us, not the 'strange'
object called PageKSM. And we need to migrate many anonymous pages
having destination a single PageKSM page already preallocated and not
being an anonymous page. We never migrate PageKSM to anything, that's
always the destination.

> Seems that we are tinkering around with the concept of what an anonymous
> page is? Doesnt shmem have some means of converting pages to file backed?
> Swizzling?

Anonymous page is anything with page->mapping pointing to an anon_vma
or swapcache_space instead of an address_space of a real inode backed
by the vfs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Izik Eidus

Jonathan Corbet wrote:


What about things like cache effects from scanning all those pages?  My
guess is that, if you're trying to run dozens of Windows guests, cache
usage is not at the top of your list of concerns, but I could be
wrong.  Usually am...
  


Ok, ksm does make the cache of the cpu dirty when scanning the pages
(but the scanning happen slowly slowly and cache usually get dirty much 
faster)
But infact it improve the cache by the fact that it make many ptes point 
to the same page
so if before we had 12 process touching 12 diffrent physical page they 
would dirty the page

but now they will touch just one...

so i guess it depend on how you see it...


Thanks,

jon
  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Andrea Arcangeli
Hi Jonathan,

On Tue, Nov 11, 2008 at 03:30:28PM -0700, Jonathan Corbet wrote:
> But it will fail in a totally silent and mysterious way.  Doesn't it
> seem better to verify the values when you can return a meaningful error
> code to the caller?

I think you're right, but just because find_extend_vma will have the
effect of growing the kernel stack down. We clearly don't set it on a
stack with KVM as there's nothing to share on the stack usually - we
only set it in the guest physical memory range. And things are safe
regardless as get_user_pages is verifying the values for us. Problem
is it's using find_extend_vma because it behaves like a page fault. We
must not behave like a pagefault, we're much closer to follow_page
only than a page fault. Not a big deal, but it can be improved by
avoiding to extend the stack somehow (likely simplest is to call
find_vma twice, first time externally, we hold mmap_sem externally so
all right).

> What about things like cache effects from scanning all those pages?  My
> guess is that, if you're trying to run dozens of Windows guests, cache
> usage is not at the top of your list of concerns, but I could be
> wrong.  Usually am...

Oh that's not an issue. This is all about trading some CPU for lots of
free memory. It pays off big as so many more VM can run. With desktop
virtualization and 1G systems, you reach a RAM bottleneck much quicker
than a CPU bottleneck. Perhaps not so quick on server virtualization
but the point is this is intentional. It may be possible to compute
the jhash (that's where the cpu is spent) with instructions that don't
pollute the cpu cache but I doubt it's going to make much an huge
difference.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Avi Kivity

Izik Eidus wrote:

Any benchmarks on the runtime cost of having KSM running?
  


This one is problematic, ksm can take anything from 0% to 100% cpu
its all depend on how fast you run it.
it have 3 parameters:
number of pages to scan before it go to sleep
maximum number of pages to merge while we scanning the above pages 
(merging is expensive)
time to sleep (when runing from userspace using /dev/ksm, we actually 
do it there (userspace)


The scan process priority also has its effect.  One strategy would be to 
run it at idle priority as long as you have enough free memory, and 
increase the priority as memory starts depleting.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Avi Kivity

Jonathan Corbet wrote:

+static struct list_head slots;



Some of these file-static variable names seem a little..terse...
  


While ksm was written to be independent of a certain TLA-named kernel 
subsystem developed two rooms away, they share some naming... this 
refers to kvm 'memory slots' which correspond to DIMM banks.


I guess it should be renamed to merge_ranges or something.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Valdis . Kletnieks
On Tue, 11 Nov 2008 15:03:45 MST, Jonathan Corbet said:

> > +#define PAGECMP_OFFSET 128
> > +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> > +/* hash the page */
> > +static void page_hash(struct page *page, unsigned char *digest)
> 
> So is this really saying that you only hash the first 128 bytes, relying on
> full compares for the rest?  I assume there's a perfectly good reason for
> doing it that way, but it's not clear to me from reading the code.  Do you
> gain performance which is not subsequently lost in the (presumably) higher
> number of hash collisions?

Seems reasonably sane to me - only doing the first 128 bytes rather than
a full 4K page is some 32 times faster.  Yes, you'll have the *occasional*
case where two pages were identical for 128 bytes but then differed, which is
why there's buckets.  But the vast majority of the time, at least one bit
will be different in the first part.

In fact, I'd not be surprised if only going for 64 bytes works well...


pgpfDEA8at6Sv.pgp
Description: PGP signature


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Izik Eidus

Jonathan Corbet wrote:

[Let's see if I can get through the rest without premature sends...]

On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

Actually, it occurs to me that there's no sanity checks on any of
the values passed in by ioctl().  What happens if the user tells
KSM to scan a bogus range of memory?

  

Well get_user_pages() run in context of the process, therefore it
should fail in "bogus range of memory"



But it will fail in a totally silent and mysterious way.  Doesn't it
seem better to verify the values when you can return a meaningful error
code to the caller?

  


Well I dont mind insert it (the above for sure is not a bug)
but even with that, the user can still free the memory that he gave to us
so this check if "nice to have check", we have nothing to do but to relay on
get_user_pages return value :)


The other ioctl() calls have the same issue; you can start the thread
with nonsensical values for the number of pages to scan and the sleep
time.
  


well about this i agree, here it make alot of logic to check the values!



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Avi Kivity

Christoph Lameter wrote:

On Tue, 11 Nov 2008, Avi Kivity wrote:

  

Christoph Lameter wrote:


page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

  

Isn't rmap the means of isolating a page from its page tables?  I guess I'm
misunderstanding something.



In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

  


Thanks.


Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?
  


We don't want to limit all faults, just writes.  So we write protect the 
page before merging.


What do you mean by page metadata?  obviously the dirty bit (Izik?), 
accessed bit and position in the LRU are desirable (the last is quite 
difficult for ksm -- the page occupied *two* positions in the LRU).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Izik Eidus

Jonathan Corbet wrote:

On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

+static int ksm_dev_open(struct inode *inode, struct file *filp)
+{
+   try_module_get(THIS_MODULE);
+   return 0;
+}
+
+static int ksm_dev_release(struct inode *inode, struct file *filp)
+{
+   module_put(THIS_MODULE);
+   return 0;
+}
+
+static struct file_operations ksm_chardev_ops = {
+   .open   = ksm_dev_open,
+   .release= ksm_dev_release,
+   .unlocked_ioctl = ksm_dev_ioctl,
+   .compat_ioctl   = ksm_dev_ioctl,
+};
  


Why do you roll your own module reference counting?  Is there a
reason you don't just set .owner and let the VFS handle it?

  
Yes, I am taking get_task_mm() if the module will be removed before i 
free the mms, things will go wrong



But...if you set .owner, the VFS will do the try_module_get() *before*
calling into your module (as an added bonus, it will actually check the
return value too).  

Ohhh i see what you mean
you are right i had at least needed to check for the return value of 
try_module_get(),

anyway will check this issue for V2.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> this page_count check done with only the tree_lock won't prevent a
> task to start O_DIRECT after page_count has been read in the above line.
>
> If a thread starts O_DIRECT on the page, and the o_direct is still in
> flight by the time you copy the page to the new page, the read will
> not be represented fully in the newpage leading to userland data
> corruption.

O_DIRECT does not take a refcount on the page in order to prevent this?

> > Define a regular VM page? A page on the LRU?
>
> Yes, pages owned, allocated and worked on by the VM. So they can be
> swapped, collected, migrated etc... You can't possibly migrate a
> device driver page for example and infact those device driver pages
> can't be migrated either.

Oh they could be migrated if you had a callback to the devices method for
giving up references. Same as slab defrag.

> The KSM page initially is a driver page, later we'd like to teach the
> VM how to swap it by introducing rmap methods and adding it to the
> LRU. As long as it's only anonymous memory that we're sharing/cloning,
> we won't have to patch pagecache radix tree and other stuff. BTW, if
> we ever decice to clone pagecache we could generate immense metadata
> ram overhead in the radix tree with just a single page of data. All
> issues that don't exist for anon ram.

Seems that we are tinkering around with the concept of what an anonymous
page is? Doesnt shmem have some means of converting pages to file backed?
Swizzling?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Jonathan Corbet
[Let's see if I can get through the rest without premature sends...]

On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> > Actually, it occurs to me that there's no sanity checks on any of
> > the values passed in by ioctl().  What happens if the user tells
> > KSM to scan a bogus range of memory?
> > 
> 
> Well get_user_pages() run in context of the process, therefore it
> should fail in "bogus range of memory"

But it will fail in a totally silent and mysterious way.  Doesn't it
seem better to verify the values when you can return a meaningful error
code to the caller?

The other ioctl() calls have the same issue; you can start the thread
with nonsensical values for the number of pages to scan and the sleep
time.

> 
> > Any benchmarks on the runtime cost of having KSM running?
> > 
> 
> This one is problematic, ksm can take anything from 0% to 100% cpu
> its all depend on how fast you run it.
> it have 3 parameters:
> number of pages to scan before it go to sleep
> maximum number of pages to merge while we scanning the above pages 
> (merging is expensive)
> time to sleep (when runing from userspace using /dev/ksm, we actually
> do it there (userspace)

What about things like cache effects from scanning all those pages?  My
guess is that, if you're trying to run dozens of Windows guests, cache
usage is not at the top of your list of concerns, but I could be
wrong.  Usually am...

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread the uni
> 2.6.26.1 has the fix, so all you need to do is upgrade to that or a later
> version.

OK. Gentoo stable is still at 2.6.25, I'll report it there.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Jonathan Corbet
On Wed, 12 Nov 2008 00:17:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> >> +static int ksm_dev_open(struct inode *inode, struct file *filp)
> >> +{
> >> +  try_module_get(THIS_MODULE);
> >> +  return 0;
> >> +}
> >> +
> >> +static int ksm_dev_release(struct inode *inode, struct file *filp)
> >> +{
> >> +  module_put(THIS_MODULE);
> >> +  return 0;
> >> +}
> >> +
> >> +static struct file_operations ksm_chardev_ops = {
> >> +  .open   = ksm_dev_open,
> >> +  .release= ksm_dev_release,
> >> +  .unlocked_ioctl = ksm_dev_ioctl,
> >> +  .compat_ioctl   = ksm_dev_ioctl,
> >> +};
> >>   
> >
> > Why do you roll your own module reference counting?  Is there a
> > reason you don't just set .owner and let the VFS handle it?
> > 
> 
> Yes, I am taking get_task_mm() if the module will be removed before i 
> free the mms, things will go wrong

But...if you set .owner, the VFS will do the try_module_get() *before*
calling into your module (as an added bonus, it will actually check the
return value too).  All you've succeeded in doing here is adding a
microscopic race to the module reference counting; otherwise the end
result is the same.

jon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > ksm need the pte inside the vma to point from anonymous page into filebacked
> > page
> > can migrate.c do it without changes?
> 
> So change anonymous to filebacked page?
>
> Currently page migration assumes that the page will continue to be part
> of the existing file or anon vma.
> 
> What you want sounds like assigning a swap pte to an anonymous page? That
> way a anon page gains membership in a file backed mapping.

KSM needs to convert anonymous pages to PageKSM, which means a page
owned by ksm.c and only known by ksm.c. The Linux VM will free this
page in munmap but that's about it, all we do is to match the number
of anon-ptes pointing to the page with the page_count. So besides
freeing the page when the last user exit()s or cows it, the VM will do
nothing about it. Initially. Later it can swap it in a nonlinear way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Izik Eidus

Jonathan Corbet wrote:

I don't claim to begin to really understand the deep VM side of this
patch, but I can certainly pick nits as I work through it...sorry for
the lack of anything more substantive.

  

+static struct list_head slots;



Some of these file-static variable names seem a little..terse...

  

+#define PAGECMP_OFFSET 128
+#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
+/* hash the page */
+static void page_hash(struct page *page, unsigned char *digest)



So is this really saying that you only hash the first 128 bytes, relying on
full compares for the rest?  I assume there's a perfectly good reason for
doing it that way, but it's not clear to me from reading the code.  Do you
gain performance which is not subsequently lost in the (presumably) higher
number of hash collisions?

  

+static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
+ int max_npages)
+{
+   struct ksm_mem_slot *slot;
+   struct page *page[1];
+   int val;
+   int ret = 0;
+
+   down_read(&slots_lock);
+
+   scan_update_old_index(ksm_scan);
+
+   while (scan_npages > 0 && max_npages > 0) {



Should this loop maybe check kthread_run too?  It seems like you could loop
for a long time after kthread_run has been set to zero.

  

+static int ksm_dev_open(struct inode *inode, struct file *filp)
+{
+   try_module_get(THIS_MODULE);
+   return 0;
+}
+
+static int ksm_dev_release(struct inode *inode, struct file *filp)
+{
+   module_put(THIS_MODULE);
+   return 0;
+}
+
+static struct file_operations ksm_chardev_ops = {
+   .open   = ksm_dev_open,
+   .release= ksm_dev_release,
+   .unlocked_ioctl = ksm_dev_ioctl,
+   .compat_ioctl   = ksm_dev_ioctl,
+};



Why do you roll your own module reference counting?  Is there a reason you
don't just set .owner and let the VFS handle it?
  


Yes, I am taking get_task_mm() if the module will be removed before i 
free the mms, things will go wrong



Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
memory, should there be some sort of capability check done there?  A check
for starting/stopping the thread might also make sense.  Or is that
expected to be handled via permissions on /dev/ksm?
  


Well, I think giving premmision to /dev/ksm default as root is enough.
No?


Actually, it occurs to me that there's no sanity checks on any of the
values passed in by ioctl().  What happens if the user tells KSM to scan a
bogus range of memory?
  


Well get_user_pages() run in context of the process, therefore it should 
fail in "bogus range of memory"



Any benchmarks on the runtime cost of having KSM running?
  


This one is problematic, ksm can take anything from 0% to 100% cpu
its all depend on how fast you run it.
it have 3 parameters:
number of pages to scan before it go to sleep
maximum number of pages to merge while we scanning the above pages 
(merging is expensive)
time to sleep (when runing from userspace using /dev/ksm, we actually do 
it there (userspace)



jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:26:57PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
> 
> > btw, page_migration likely is buggy w.r.t. o_direct too (and now
> > unfixable with gup_fast until the 2.4 brlock is added around it or
> > similar) if it does the same thing but without any page_mapcount vs
> > page_count check.
> 
> Details please?

  spin_lock_irq(&mapping->tree_lock);

  pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

expected_count = 2 + !!PagePrivate(page);
if (page_count(page) != expected_count ||

this page_count check done with only the tree_lock won't prevent a
task to start O_DIRECT after page_count has been read in the above line.

If a thread starts O_DIRECT on the page, and the o_direct is still in
flight by the time you copy the page to the new page, the read will
not be represented fully in the newpage leading to userland data
corruption.

> > page_migration does too much for us, so us calling into migrate.c may
> > not be ideal. It has to convert a fresh page to a VM page. In KSM we
> > don't convert the newpage to be a VM page, we just replace the anon
> > page with another page. The new page in the KSM case is not a page
> > known by the VM, not in the lru etc...
> 
> A VM page as opposed to pages not in the VM? ???

Yes, you migrate old VM pages to new VM pages. We replace VM pages
with unknown stuff called KSM pages. So in the inner function where you
replace the pte-migration-placeholder with a pte pointing to the
newpage, you also rightfully do unconditional stuff we can't be doing
like page_add_*_rmap. It's VM pages you're dealing with. Not for us.

> page migration requires the page to be on the LRU. That could be changed
> if you have a different means of isolating a page from its page tables.

Yes we'd need to change those bits to be able to use migrate.c.

> Define a regular VM page? A page on the LRU?

Yes, pages owned, allocated and worked on by the VM. So they can be
swapped, collected, migrated etc... You can't possibly migrate a
device driver page for example and infact those device driver pages
can't be migrated either.

The KSM page initially is a driver page, later we'd like to teach the
VM how to swap it by introducing rmap methods and adding it to the
LRU. As long as it's only anonymous memory that we're sharing/cloning,
we won't have to patch pagecache radix tree and other stuff. BTW, if
we ever decice to clone pagecache we could generate immense metadata
ram overhead in the radix tree with just a single page of data. All
issues that don't exist for anon ram.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Jonathan Corbet
I don't claim to begin to really understand the deep VM side of this
patch, but I can certainly pick nits as I work through it...sorry for
the lack of anything more substantive.

> +static struct list_head slots;

Some of these file-static variable names seem a little..terse...

> +#define PAGECMP_OFFSET 128
> +#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
> +/* hash the page */
> +static void page_hash(struct page *page, unsigned char *digest)

So is this really saying that you only hash the first 128 bytes, relying on
full compares for the rest?  I assume there's a perfectly good reason for
doing it that way, but it's not clear to me from reading the code.  Do you
gain performance which is not subsequently lost in the (presumably) higher
number of hash collisions?

> +static int ksm_scan_start(struct ksm_scan *ksm_scan, int scan_npages,
> +   int max_npages)
> +{
> + struct ksm_mem_slot *slot;
> + struct page *page[1];
> + int val;
> + int ret = 0;
> +
> + down_read(&slots_lock);
> +
> + scan_update_old_index(ksm_scan);
> +
> + while (scan_npages > 0 && max_npages > 0) {

Should this loop maybe check kthread_run too?  It seems like you could loop
for a long time after kthread_run has been set to zero.

> +static int ksm_dev_open(struct inode *inode, struct file *filp)
> +{
> + try_module_get(THIS_MODULE);
> + return 0;
> +}
> +
> +static int ksm_dev_release(struct inode *inode, struct file *filp)
> +{
> + module_put(THIS_MODULE);
> + return 0;
> +}
> +
> +static struct file_operations ksm_chardev_ops = {
> + .open   = ksm_dev_open,
> + .release= ksm_dev_release,
> + .unlocked_ioctl = ksm_dev_ioctl,
> + .compat_ioctl   = ksm_dev_ioctl,
> +};

Why do you roll your own module reference counting?  Is there a reason you
don't just set .owner and let the VFS handle it?

Given that the KSM_REGISTER_MEMORY_REGION ioctl() creates unswappable
memory, should there be some sort of capability check done there?  A check
for starting/stopping the thread might also make sense.  Or is that
expected to be handled via permissions on /dev/ksm?

Actually, it occurs to me that there's no sanity checks on any of the
values passed in by ioctl().  What happens if the user tells KSM to scan a
bogus range of memory?

Any benchmarks on the runtime cost of having KSM running?

jon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 12:38:51PM -0800, Andrew Morton wrote:
> Please fully document that interface in the changelog so that we can
> review your decisions here.  This is by far the most important
> consideration - we can change all the code, but interfaces are for
> ever.

Yes, this is the most important point in my view. Even after we make
the ksm pages swappable it'll remain an invisible change to anybody
but us (it'll work better under VM pressure, but that's about it).

> uh-oh, ioctls.

Yes, it's all ioctl based. In very short, it assigns the task and
memory region to scan, and allows to start/stop the kernel thread that
does the scan while selecting how many pages to execute per scan and
how many scans to execute per second. The more pages per scan and the
more scans per second, the higher cpu utilization of the kernel thread.

It would also be possible to submit ksm in a way that has no API at
all (besides kernel module params tunable later by sysfs to set
pages-per-scan and scan-frequency). Doing that would allow us to defer
the API decisions. But then all anonymous memory in the system will be
scanned unconditionally even if there may be little to share for
certain tasks. It would perform quite well, usually the sharable part
is the largest part so the rest wouldn't generate an huge amount of
cpu waste. There's some ram waste too, as some memory has to be
allocated for every page we want to possibly share.

In some ways removing the API would make it simpler to use for non
virtualization environments where they may want to enable it
system-wide.

> ooh, a comment!

8)

> > +   kpage = alloc_page(GFP_KERNEL |  __GFP_HIGHMEM);
> 
> Stray whitepace.
> 
> Replace with GFP_HIGHUSER.

So not a cleanup, but an improvement (I agree highuser is better, this
isn't by far any higher-priority kernel alloc and it deserves to have
lower prio in the watermarks).

> The term "shared memory" has specific meanings in Linux/Unix.  I'm
> suspecting that its use here was inappropriate but I have insufficient
> information to be able to suggest alternatives.

We could call it create_shared_anonymous_memory but then what if it
ever becomes capable of sharing pagecache too? (I doubt it will, not
in the short term at least ;)

Usually when we do these kind of tricks we use the word cloning, so
perhaps also create_cloned_memory_area, so you can later say cloned
anonymous memory or cloned shared memory page instead of just KSM
page. But then this module would become KCM and not KSM 8). Perhaps we
should just go recursive and use create_ksm_memory_area.

> Generally: this review was rather a waste of your time and of mine
> because the code is inadequately documented.

Well, this was a great review considering how little the code was
documented, definitely not a waste of time on our end, it was very
helpful and the good thing is I don't see any controversial stuff.

The two inner loops are the core of the ksm scan, and they're aren't
very simple I've to say. Documenting won't be trivial but it's surely
possible, Izik already working on it I think! Apologies if any time
was wasted on your side!!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:

On Tue, 11 Nov 2008, Avi Kivity wrote:

  

Christoph Lameter wrote:


page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

  

Isn't rmap the means of isolating a page from its page tables?  I guess I'm
misunderstanding something.



In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.
  

This is really not the case for ksm,
in ksm we allow the page to be accessed all the time, we dont have to 
swap the page

like migrate.c is doing...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Avi Kivity wrote:

> Christoph Lameter wrote:
> > page migration requires the page to be on the LRU. That could be changed
> > if you have a different means of isolating a page from its page tables.
> >
>
> Isn't rmap the means of isolating a page from its page tables?  I guess I'm
> misunderstanding something.

In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> > What do you mean by kernel page? The kernel can allocate a page and then
> > point a user space pte to it. That is how page migration works.
> >
> i mean filebacked page (!AnonPage())

ok.

> ksm need the pte inside the vma to point from anonymous page into filebacked
> page
> can migrate.c do it without changes?

So change anonymous to filebacked page?

Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.

What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Avi Kivity

Christoph Lameter wrote:

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.
  


Isn't rmap the means of isolating a page from its page tables?  I guess 
I'm misunderstanding something.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:



Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.
  


exactly, and ksm really need it to get out of the existing anon vma!


What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.


  


No, i want pte that is found inside vma and point to anonymous page, 
will stop point into the anonymous page
and will point to a whole diffrent page that i chose (for ksm it is 
needed because this way we are mapping alot

of ptes into the same write_protected page and save memory)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> btw, page_migration likely is buggy w.r.t. o_direct too (and now
> unfixable with gup_fast until the 2.4 brlock is added around it or
> similar) if it does the same thing but without any page_mapcount vs
> page_count check.

Details please?

> page_migration does too much for us, so us calling into migrate.c may
> not be ideal. It has to convert a fresh page to a VM page. In KSM we
> don't convert the newpage to be a VM page, we just replace the anon
> page with another page. The new page in the KSM case is not a page
> known by the VM, not in the lru etc...

A VM page as opposed to pages not in the VM? ???

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

> The way to go could be to change the page_migration to use
> replace_page (or __replace_page if called in some shared inner-lock
> context) after preparing the newpage to be a regular VM page. If we
> can do that, migrate.c will get the o_direct race fixed too for free.

Define a regular VM page? A page on the LRU?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:21:45PM -0600, Christoph Lameter wrote:
> What do you mean by kernel page? The kernel can allocate a page and then
> point a user space pte to it. That is how page migration works.

Just to make an example, remove_migration_pte adds the page back to
rmap layer. We can't do that right now as rmap for the ksm pages will
be built inside ksm, or alternatively rmap.c will have to learn to
handle nonlinear anon-vma.

Migration simply migrates the page. The new page is identical to the
original one, just backed by different physical memory.

For us the new page is an entirely different beast that we build
ourself (we can't let migrate.c to pretend dealing with the newpage
like if it resembled the old page like it's doing now).

We replace a linear anon page with something that isn't an anonymous
page at all right now (in the future it may become a nonlinear anon
page if VM learns about it, or still an unknown page
external-rmappable if we go the external-rmap way).

There's clearly something to share, but the replace_page seem to be
the one that could be called from migrate.c. What is different is that
we don't need the migration pte placeholder, we never block releasing
locks, all atomic with pte wrprotected, and a final pte_same check
under PT lock before we replace the page. There isn't a whole lot to
share after all, but surely it'd be nice to share if we can. Us
calling into migrate.c isn't feasible right now without some
significant change to migrate.c where it would be misplaced IMHO as to
share we'd need migrate.c to call into VM core instead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Allocate guest memory on host page boundaries.

2008-11-11 Thread Hollis Blanchard
Signed-off-by: Hollis Blanchard <[EMAIL PROTECTED]>
diff --git a/osdep.c b/osdep.c
index 8ff260a..30a1263 100644
--- a/osdep.c
+++ b/osdep.c
@@ -213,7 +213,7 @@ void *qemu_vmalloc(size_t size)
 #ifdef _BSD
 return valloc(size);
 #else
-return memalign(4096, size);
+return memalign(qemu_getpagesize(), size);
 #endif
 }
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:

page migration as far as i saw cant migrate anonymous page into kernel page.
if you want we can change page_migration to do that, but i thought you will
rather have ksm changes separate.



What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.
  

i mean filebacked page (!AnonPage())
ksm need the pte inside the vma to point from anonymous page into 
filebacked page

can migrate.c do it without changes?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> yes but it replace it with kernel allocated page.
> > page migration already kinda does that.  Is there common ground?
> >
> >
> page migration as far as i saw cant migrate anonymous page into kernel page.
> if you want we can change page_migration to do that, but i thought you will
> rather have ksm changes separate.

What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 01:01:49PM -0800, Andrew Morton wrote:
> On Tue, 11 Nov 2008 21:38:06 +0100
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> 
> > > > + * set all the ptes pointed to a page as read only,
> > > > + * odirect_sync is set to 0 in case we cannot protect against race 
> > > > with odirect
> > > > + * return the number of ptes that were set as read only
> > > > + * (ptes that were read only before this function was called are 
> > > > couned as well)
> > > > + */
> > > 
> > > But it isn't.
> > 
> > What isn't?
> 
> This code comment had the kerneldoc marker ("/**") but it isn't a
> kerneldoc comment.

8) never mind, I thought it was referred to the quoted comment, like
if the comment pretended something the function wasn't doing.

> OK, well can we please update the code so these things are clearer.

Sure. Let's say this o_direct fix was done after confirmation that
this was the agreed fix to solve those kind of problems (same fix that
fork will need as in the third link).

> (It's a permanent problem I have.  I ask "what is this", but I really
> mean "the code should be changed so that readers will know what this is")

Agreed, this deserves much more commentary. But not much effort was
done in this area, because fork (and likely migrate) has the same race
and it isn't fixed yet so this is still a work-in-progress area. The
fix has to be the same for all places that have this bug, and we have
not agreed on a way to fix gup_fast yet (all I provided as suggestion
that will work fine is brlock but surely Nick will try to find a way
that remains non-blocking, which is the core of the problem, if it
can't block, we've to use RCU but we can't wait there as far as I can
tell as all sort of synchronous memory regions are involved).

I think once the problem will get fixed and patches will go in
mainline, more docs will emerge (I hope ;). We'll most certainly need
changes to cover gup_fast (including if we use brlock, the read side
of the lock will have to be taken around it). This was a fix we did at
the last minute just to reflect the latest status.

I CC Nick to this thread btw.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 11:45:55AM -0800, Andrew Morton wrote:
> page migration already kinda does that.  Is there common ground?

btw, page_migration likely is buggy w.r.t. o_direct too (and now
unfixable with gup_fast until the 2.4 brlock is added around it or
similar) if it does the same thing but without any page_mapcount vs
page_count check.

page_migration does too much for us, so us calling into migrate.c may
not be ideal. It has to convert a fresh page to a VM page. In KSM we
don't convert the newpage to be a VM page, we just replace the anon
page with another page. The new page in the KSM case is not a page
known by the VM, not in the lru etc...

The way to go could be to change the page_migration to use
replace_page (or __replace_page if called in some shared inner-lock
context) after preparing the newpage to be a regular VM page. If we
can do that, migrate.c will get the o_direct race fixed too for free.

> I don't understand the restrictions on anonymous pages.  Please expand
> the changelog so that reviewers can understand the reasons for this
> restriction.

That's a good question but I don't have a definitive answer as I
didn't account for exactly how complex it would be yet.

Suppose a file has 0-4k equal to 4k-8k. A MAP_SHARED that maps both
pages with the same physical page sounds tricky. The shared pages are
KSM owned and invisible to the VM (later could be made visible with an
external-rmap), but pagecache can't be just KSM owned, they at least
must go in pagecache radix tree so that requires patching the radix
tree etc... All things we don't need for anon ram. I think first thing
to extend is to add external-rmap and make ksm swappable, then we can
think if something can be done about MAP_SHARED/MAP_PRIVATE too
(MAP_PRIVATE post-COW already works, the question is pre-COW). One
excuse of why I didn't think too much about it yet is because in
effect KSM it's mostly useful to the anon ram, the pagecache can be
solved in userland with hardlinks with openvz, and shared libs already
do all they can to share .text (the not-shared post dynamic link
should be covered by ksm already).

> Again, we could make the presence of this code selectable by subsystems
> which want it.

Indeed!!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 21:38:06 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> > > + * set all the ptes pointed to a page as read only,
> > > + * odirect_sync is set to 0 in case we cannot protect against race with 
> > > odirect
> > > + * return the number of ptes that were set as read only
> > > + * (ptes that were read only before this function was called are couned 
> > > as well)
> > > + */
> > 
> > But it isn't.
> 
> What isn't?

This code comment had the kerneldoc marker ("/**") but it isn't a
kerneldoc comment.

> > I don't understand this odirect_sync thing.  What race?  Please expand
> > this comment to make the function of odirect_sync more understandable.
> 
> I should have answered this one with the above 3 links.

OK, well can we please update the code so these things are clearer.

(It's a permanent problem I have.  I ask "what is this", but I really
mean "the code should be changed so that readers will know what this is")

> > What do you think about making all this new code dependent upon some
> > CONFIG_ switch which CONFIG_KVM can select?
> 
> I like that too.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Andrew Morton wrote:

On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

From: Izik Eidus <[EMAIL PROTECTED]>

this function is needed in cases you want to change the userspace
virtual mapping into diffrent physical page,



Not sure that I understand that description.  We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.
  

yes but it replace it with kernel allocated page.

page migration already kinda does that.  Is there common ground?

  

page migration as far as i saw cant migrate anonymous page into kernel page.
if you want we can change page_migration to do that, but i thought you 
will rather have ksm changes separate.



KSM need this for merging the identical pages.

this function is working by removing the oldpage from the rmap and
calling put_page on it, and by setting the virtual address pte
to point into the new page.
(note that the new page (the page that we change the pte to map to)
cannot be anonymous page)




I don't understand the restrictions on anonymous pages.  Please expand
the changelog so that reviewers can understand the reasons for this
restriction.
  
the page that we are going to map into the pte going to be nonlinear 
from the point of view of anon-vma

therefore it cannot be anonymous.



  

---
 include/linux/mm.h |3 ++
 mm/memory.c|   68 
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..4da7fa8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,

+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..b2c542c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+/**

+ * replace _page - replace the pte mapping related to vm area between two pages



s/replace _page/replace_page/

  

+ * (from oldpage to newpage)
+ * NOTE: you should take into consideration the impact on the VM when replacing
+ * anonymous pages with kernel non swappable pages.
+ */



This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

  

+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage, vma);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL(replace_page);



Again, we could make the presence of this code selectable by subsystems
which want it.

]


sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:40 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> From: Izik Eidus <[EMAIL PROTECTED]>
> 
> ksm is driver that allow merging identical pages between one or more
> applications in way unvisible to the application that use it.
> pages that are merged are marked as readonly and are COWed when any 
> application
> try to change them.
> 
> ksm is working by walking over the memory pages of the applications it scan
> in order to find identical pages.
> it uses an hash table to find in effective way the identical pages.
> 
> when ksm find two identical pages, it marked them as readonly and merge them
> into single one page,
> after the pages are marked as readonly and merged into one page, linux
> will treat this pages as normal copy_on_write pages and will fork them
> when write access will happen to them.
> 
> ksm scan just memory areas that were registred to be scanned by it.
> 

This driver apepars to implement a new userspace interface, in /dev/ksm

Please fully document that interface in the changelog so that we can
review your decisions here.  This is by far the most important
consideration - we can change all the code, but interfaces are for
ever.

> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index d38f43f..c1c701f 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -105,4 +105,9 @@ source "drivers/uio/Kconfig"
>  source "drivers/xen/Kconfig"
>  
>  source "drivers/staging/Kconfig"
> +
> +config KSM
> + bool "KSM driver support"
> + help
> + ksm is driver for merging identical pages between applciations

s/is/is a/

"applications" is misspelt.

>  endmenu
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> new file mode 100644
> index 000..f873502
> --- /dev/null
> +++ b/include/linux/ksm.h
> @@ -0,0 +1,53 @@
> +#ifndef __LINUX_KSM_H
> +#define __LINUX_KSM_H
> +
> +/*
> + * Userspace interface for /dev/ksm - kvm shared memory
> + */
> +
> +#include 
> +#include 
> +
> +#define KSM_API_VERSION 1
> +
> +/* for KSM_REGISTER_MEMORY_REGION */
> +struct ksm_memory_region {
> + __u32 npages; /* number of pages to share */
> + __u32 pad;
> + __u64 addr; /* the begining of the virtual address */
> +};
> +
> +struct ksm_user_scan {
> + __u32 pages_to_scan;
> + __u32 max_pages_to_merge;
> +};
> +
> +struct ksm_kthread_info {
> + __u32 sleep; /* number of microsecoends to sleep */
> + __u32 pages_to_scan; /* number of pages to scan */
> + __u32 max_pages_to_merge;
> + __u32 running;
> +};
> +
> +#define KSMIO 0xAB
> +
> +/* ioctls for /dev/ksm */
> +#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
> +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd 
> */
> +#define KSM_CREATE_SCAN  _IO(KSMIO,   0x02) /* return SCAN 
> fd */
> +#define KSM_START_STOP_KTHREAD_IOW(KSMIO,  0x03,\
> +   struct ksm_kthread_info)
> +#define KSM_GET_INFO_KTHREAD  _IOW(KSMIO,  0x04,\
> +   struct ksm_kthread_info) 
> +
> +
> +/* ioctls for SMA fds */
> +#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
> +   struct ksm_memory_region)
> +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
> +
> +/* ioctls for SCAN fds */
> +#define KSM_SCAN _IOW(KSMIO,  0x40,\
> +   struct ksm_user_scan)
> +
> +#endif

uh-oh, ioctls.

Please do document that interface for us.

> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 26433ec..adc2435 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -30,6 +30,7 @@
>  #define TUN_MINOR 200
>  #define  HPET_MINOR   228
>  #define KVM_MINOR232
> +#define KSM_MINOR233
>  
>  struct device;
>  
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 5b5790f..e7f0061 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -222,3 +222,6 @@ config UNEVICTABLE_LRU
>  
>  config MMU_NOTIFIER
>   bool
> +
> +config KSM
> + bool
> diff --git a/mm/Makefile b/mm/Makefile
> index c06b45a..9722afe 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
>  obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
>  obj-$(CONFIG_SLOB) += slob.o
>  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
> +obj-$(CONFIG_KSM) += ksm.o
>  obj-$(CONFIG_SLAB) += slab.o
>  obj-$(CONFIG_SLUB) += slub.o
>  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> diff --git a/mm/ksm.c b/mm/ksm.c
> new file mode 100644
> index 000..977eb37
> --- /dev/null
> +++ b/mm/ksm.c
> @@ -0,0 +1,1202 @@
> +/*
> + * Memory merging driver for Linux
> + *
> + * This module enables dynamic sharing of identical pages found in different
> + * memory areas, even if they are not shared by fork()
> + *
> + * Copyright (C) 2008 Red Hat, Inc.
> + * Authors:
> + * 

Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 11:39:48AM -0800, Andrew Morton wrote:
> > +static int page_wrprotect_one(struct page *page, struct vm_area_struct 
> > *vma,
> > + int *odirect_sync)
> > +{
> > +   struct mm_struct *mm = vma->vm_mm;
> > +   unsigned long address;
> > +   pte_t *pte;
> > +   spinlock_t *ptl;
> > +   int ret = 0;
> > +
> > +   address = vma_address(page, vma);
> > +   if (address == -EFAULT)
> > +   goto out;
> > +
> > +   pte = page_check_address(page, mm, address, &ptl, 0);
> > +   if (!pte)
> > +   goto out;
> > +
> > +   if (pte_write(*pte)) {
> > +   pte_t entry;
> > +
> > +   if (page_mapcount(page) != page_count(page)) {
> > +   *odirect_sync = 0;
> > +   goto out_unlock;
> > +   }
> > +   flush_cache_page(vma, address, pte_pfn(*pte));
> > +   entry = ptep_clear_flush_notify(vma, address, pte);
> > +   entry = pte_wrprotect(entry);
> > +   set_pte_at(mm, address, pte, entry);
> > +   }
> > +   ret = 1;
> > +
> > +out_unlock:
> > +   pte_unmap_unlock(pte, ptl);
> > +out:
> > +   return ret;
> > +}
> 
> OK.  I think.  We need to find a way of provoking Hugh to look at it.

Yes. Please focus on the page_mapcount != page_count, which is likely
missing from migrate.c too and in turn page migration currently breaks
O_DIRECT like fork() is buggy as well as discussed here:

http://marc.info/?l=linux-mm&m=122236799302540&w=2
http://marc.info/?l=linux-mm&m=122524107519182&w=2
http://marc.info/?l=linux-mm&m=122581116713932&w=2

The fix implemented in ksm currently handles older kernels (like
rhel/sles) not current mainline that does
get_user_pages_fast. get_user_pages_fast is unfixable yet (see my last
email to Nick above asking for a way to block gup_fast).

The fix proposed by Nick plus my additional fix, should stop the
corruption in fork the same way the above check fixes it for ksm. But
todate gup_fast remains unfixable.

> > +static int page_wrprotect_file(struct page *page, int *odirect_sync)
> > +{
> > +   struct address_space *mapping;
> > +   struct prio_tree_iter iter;
> > +   struct vm_area_struct *vma;
> > +   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > +   int ret = 0;
> > +
> > +   mapping = page_mapping(page);
> 
> What pins *mapping in memory?  Usually this is done by requiring that
> the caller has locked the page.  But no such precondition is documented
> here.

Looks buggy but we never call it from ksm 8). I guess Izik added it
for completeness when preparing for mainline submission. We've the
option to get rid of page_wrprotect_file entirely and only implement a
page_wrprotect_anon! Otherwise we can add a BUG_ON(!PageLocked(page))
before the above page_mapping to protect against truncate.

> > + * set all the ptes pointed to a page as read only,
> > + * odirect_sync is set to 0 in case we cannot protect against race with 
> > odirect
> > + * return the number of ptes that were set as read only
> > + * (ptes that were read only before this function was called are couned as 
> > well)
> > + */
> 
> But it isn't.

What isn't?

> I don't understand this odirect_sync thing.  What race?  Please expand
> this comment to make the function of odirect_sync more understandable.

I should have answered this one with the above 3 links.

> What do you think about making all this new code dependent upon some
> CONFIG_ switch which CONFIG_KVM can select?

I like that too.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus

Izik Eidus wrote:

Andrew Morton wrote:

On Tue, 11 Nov 2008 21:18:23 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

 

hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  (And how much would we gain 
from

it?)

[I'm looking for reasons why this is more than a 
muck-up-the-vm-for-kvm

thing here ;) ]
  


^^ this?

 
KSM is separate driver , it doesn't change anything in the VM but 
adding two helper functions.



What, you mean I should actually read the code?   Oh well, OK.
  

Andrea i think what is happening here is my fault

Sorry, meant to write here Andrew :-)

i will try to give here much more information about KSM:
first the bad things:
KSM shared pages are right now (we have patch that can change it but 
we want to wait with it) unswappable
this mean that the entire memory of the guest is swappable but the 
pages that are shared are not.
(when the pages are splited back by COW they become anonymous again 
with the help of do_wp_page()
the reason that the pages are not swappable is beacuse the way the 
Linux Rmap is working, this not allow us to create nonlinear anonymous 
pages
(we dont want to use nonlinear vma for kvm, as it will make swapping 
for kvm very slow)
the reason that ksm pages need to have nonlinear reverse mapping is 
that for one guest identical page can be found in whole diffrent 
offset than other guest have it

(this is from the userspace VM point of view)

the rest is quite simple:
it is walking over the entire guest memory (or only some of it) and 
scan for identical pages using hash table

it merge the pages into one single write protected page

numbers for ksm is something that i have just for desktops and just 
the numbers i gave you

what is do know is:
big overcommit like 300% is possible just when you take into account 
that some of the guest memory will be free
we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and 
KERNEL PAGEs this pages likely will never break
but ZERO pages will be break when windows will allocate them and will 
come back when windows will free the memory.
(i wouldnt suggest 300% overcommit for servers workload, beacuse you 
can end up swapping in that case,
but for desktops after runing in production and passed some seiroes qa 
tress tests it seems like 300% is a real number that can be use)


i just ran test on two fedora 8 guests and got that results (using 
GNOME in both of them)
9959 root  15   0  730m 537m 281m S8  3.4   0:44.28 
kvm


9956 root  15   0  730m 537m 246m S4  3.4   0:41.43 kvm
as you can see the physical sharing was 281mb and 246mb (kernel pages 
are counted as shared)
there is small lie in this numbers beacuse pages that was shared 
across two guests and was splited by writing from guest number 1 will 
still have 1 refernce count to it
and will still be kernel page (untill the other guest (num 2) will 
write to it as well)



anyway i am willing to make much better testing or everything that 
needed for this patchs to be merged.

(just tell me what and i will do it)

beside that you should know that patch 4 is not a must, it is just 
nice optimization...


thanks.

--
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrea Arcangeli
Hi Andrew,

thanks for looking into this.

On Tue, Nov 11, 2008 at 11:11:10AM -0800, Andrew Morton wrote:
> What userspace-only changes could fix this?  Identify the common data,
> write it to a flat file and mmap it, something like that?

The whole idea is to do something that works transparently and isn't
tailored for kvm. The mmu notifier change_pte method can be dropped as
well if you want (I recommended not to have it in the first submission
but Izik preferred to keep it because it will optimize away a kvm
shadow pte minor fault the first time kvm access the page after
sharing it). The page_wrprotect and replace_page can also be embedded
in ksm.

So the idea is that while we could do something specific to ksm that
keeps most of the code in userland, it'd be more tricky as it'd
require some communication with the core VM anyway (we can't just do
it in userland with mprotect, memcpy, mmap(MAP_PRIVATE) as it wouldn't
be atomic and second it'd be inefficient in terms of vma-buildup for
the same reason nonlinear-vmas exist), but most important: it wouldn't
work for all other regular process. With KSM we can share anonymous
memory for the whole system, KVM is just a random user.

This implementation is on the simple side because it can't
swap. Swapping and perhaps the limitation of sharing anonymous memory
is the only trouble here but those will be addressed in the
future. ksm is a new device driver so it's like /dev/mem, so no
swapping isn't a blocker here.

By sharing anon pages, in short we're making anonymous vmas nonlinear,
and this isn't supported by the current rmap code. So swapping can't
work unless we mark those anon-vmas nonlinear and we either build the
equivalent of the old pte_chains on demand just for those nonlinear
shared pages, or we do a full scan of all ptes in the nonlinear
anon-vmas. An external rmap infrastructure can allow ksm to build
whatever it wants inside ksm.c to track the nonlinear anon-pages
inside a regular anon-vma and rmap.c can invoke those methods to find
the ptes for those nonlinear pages. The core VM won't get more complex
and ksm can decide if to do a full nonlinear scan of the vma, or to
build the equivalent of pte_chains. This again has to be added later
and once everybody sees ksm, it'll be easier to agree on a
external-rmap API to allow it to swap. While the pte_chains are very
inefficent to reverse the regular anonymous mappings, they're
efficient solution as an exception for the shared KSM pages that gets
scattered over the linear anon-vmas.

It's a bit like the initial kvm that was merged despite it couldn't
swap. Then we added mmu notifiers, and now kvm can swap. So we add ksm
now without swapping and later we build an external-rmap to allow ksm
to swap after we agree ksm is useful and people starts using it.

> There has been the occasional discussion about idenfifying all-zeroes
> pages and scavenging them, repointing them at the zero page.  Could
> this infrastructure be used for that?  (And how much would we gain from
> it?)

Zero pages makes a lot of difference for windows, but they're totally
useless for linux. With current ksm all guest pagecache is 100% shared
across hosts, so when you start an app the .text runs on the same
physical memory on both guests. Works fine and code is quite simple in
this round. Once we add swapping it'll be a bit more complex in VM
terms as it'll have to handle nonlinear anon-vmas.

If we ever decide to share MAP_SHARED pagecache it'll be even more
complicated than just adding the external-rmap... I think this can be
done incrementally if needed at all. OpenVZ if the install is smart
enough could share the pagecache by just hardlinking the equal
binaries.. but AFIK they don't do that normally. For the anon ram they
need this too, they can't solve equal anon ram in userland as it has
to be handled atomically at runtime.

The folks at CERN LHC (was visiting it last month) badly need KSM too
for certain apps they're running that are allocating huge arrays
(aligned) in anon memory and they're mostly equal for all
processes. They tried to work around it with fork but it's not working
well, KSM would solve their problem (it'd solve it both on the same OS
and across OS with kvm as virtualization engine running on the same host).

So I think this is good stuff, and I'd focus discussions and reviews
on the KSM API of /dev/ksm that if merged will be longstanding and
much more troublesome than the rest of the code if changed later (if
we change the ksm internals at any time nobody will notice), and
post-merging we can focus on the external-rmap to make KSM pages first
class citizens in VM terms. But then anything can be changed here, so
suggestions welcome!

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus

Andrew Morton wrote:

On Tue, 11 Nov 2008 21:18:23 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  (And how much would we gain from
it?)

[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]
  


^^ this?

  
KSM is separate driver , it doesn't change anything in the VM but adding 
two helper functions.



What, you mean I should actually read the code?   Oh well, OK.
  

Andrea i think what is happening here is my fault
i will try to give here much more information about KSM:
first the bad things:
KSM shared pages are right now (we have patch that can change it but we 
want to wait with it) unswappable
this mean that the entire memory of the guest is swappable but the pages 
that are shared are not.
(when the pages are splited back by COW they become anonymous again with 
the help of do_wp_page()
the reason that the pages are not swappable is beacuse the way the Linux 
Rmap is working, this not allow us to create nonlinear anonymous pages
(we dont want to use nonlinear vma for kvm, as it will make swapping for 
kvm very slow)
the reason that ksm pages need to have nonlinear reverse mapping is that 
for one guest identical page can be found in whole diffrent offset than 
other guest have it

(this is from the userspace VM point of view)

the rest is quite simple:
it is walking over the entire guest memory (or only some of it) and scan 
for identical pages using hash table

it merge the pages into one single write protected page

numbers for ksm is something that i have just for desktops and just the 
numbers i gave you

what is do know is:
big overcommit like 300% is possible just when you take into account 
that some of the guest memory will be free
we are sharing mostly the DLLs/ KERNEL / ZERO pages, for the DLLS and 
KERNEL PAGEs this pages likely will never break
but ZERO pages will be break when windows will allocate them and will 
come back when windows will free the memory.
(i wouldnt suggest 300% overcommit for servers workload, beacuse you can 
end up swapping in that case,
but for desktops after runing in production and passed some seiroes qa 
tress tests it seems like 300% is a real number that can be use)


i just ran test on two fedora 8 guests and got that results (using GNOME 
in both of them)
9959 root  15   0  730m 537m 281m S8  3.4   0:44.28 
kvm


9956 root  15   0  730m 537m 246m S4  3.4   0:41.43 kvm
as you can see the physical sharing was 281mb and 246mb (kernel pages 
are counted as shared)
there is small lie in this numbers beacuse pages that was shared across 
two guests and was splited by writing from guest number 1 will still 
have 1 refernce count to it
and will still be kernel page (untill the other guest (num 2) will write 
to it as well)



anyway i am willing to make much better testing or everything that 
needed for this patchs to be merged.

(just tell me what and i will do it)

beside that you should know that patch 4 is not a must, it is just nice 
optimization...


thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> From: Izik Eidus <[EMAIL PROTECTED]>
> 
> this function is needed in cases you want to change the userspace
> virtual mapping into diffrent physical page,

Not sure that I understand that description.  We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.

page migration already kinda does that.  Is there common ground?

> KSM need this for merging the identical pages.
> 
> this function is working by removing the oldpage from the rmap and
> calling put_page on it, and by setting the virtual address pte
> to point into the new page.
> (note that the new page (the page that we change the pte to map to)
> cannot be anonymous page)
> 

I don't understand the restrictions on anonymous pages.  Please expand
the changelog so that reviewers can understand the reasons for this
restriction.


> ---
>  include/linux/mm.h |3 ++
>  mm/memory.c|   68 
> 
>  2 files changed, 71 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ffee2f7..4da7fa8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
> long addr,
>  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
>  
> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> +  struct page *newpage, pte_t orig_pte, pgprot_t prot);
> +
>  struct page *follow_page(struct vm_area_struct *, unsigned long address,
>   unsigned int foll_flags);
>  #define FOLL_WRITE   0x01/* check pte is writable */
> diff --git a/mm/memory.c b/mm/memory.c
> index 164951c..b2c542c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
> unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +/**
> + * replace _page - replace the pte mapping related to vm area between two 
> pages

s/replace _page/replace_page/

> + * (from oldpage to newpage)
> + * NOTE: you should take into consideration the impact on the VM when 
> replacing
> + * anonymous pages with kernel non swappable pages.
> + */

This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> +  struct page *newpage, pte_t orig_pte, pgprot_t prot)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int ret;
> +
> + BUG_ON(PageAnon(newpage));
> +
> + ret = -EFAULT;
> + addr = page_address_in_vma(oldpage, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out;
> +
> + if (!pte_same(*ptep, orig_pte)) {
> + pte_unmap_unlock(ptep, ptl);
> + goto out;
> + }
> +
> + ret = 0;
> + get_page(newpage);
> + page_add_file_rmap(newpage);
> +
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
> +
> + page_remove_rmap(oldpage, vma);
> + if (PageAnon(oldpage)) {
> + dec_mm_counter(mm, anon_rss);
> + inc_mm_counter(mm, file_rss);
> + }
> + put_page(oldpage);
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(replace_page);

Again, we could make the presence of this code selectable by subsystems
which want it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:38 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> From: Izik Eidus <[EMAIL PROTECTED]>
> 
> this function is useful for cases you want to compare page and know
> that its value wont change during you compare it.
> 
> this function is working by walking over the whole rmap of a page
> and mark every pte related to the page as write_protect.
> 
> the odirect_sync paramter is used to notify the caller of
> page_wrprotect() if one pte or more was not marked readonly
> in order to avoid race with odirect.

The grammar is a bit mangled.  Missing apostrophes.  Sentences start
with capital letters.

Yeah, it's a nit, but we don't actually gain anything from deliberately
mangling the language.

>
> ...
>
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(page_mkclean);
>  
> +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
> +   int *odirect_sync)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long address;
> + pte_t *pte;
> + spinlock_t *ptl;
> + int ret = 0;
> +
> + address = vma_address(page, vma);
> + if (address == -EFAULT)
> + goto out;
> +
> + pte = page_check_address(page, mm, address, &ptl, 0);
> + if (!pte)
> + goto out;
> +
> + if (pte_write(*pte)) {
> + pte_t entry;
> +
> + if (page_mapcount(page) != page_count(page)) {
> + *odirect_sync = 0;
> + goto out_unlock;
> + }
> + flush_cache_page(vma, address, pte_pfn(*pte));
> + entry = ptep_clear_flush_notify(vma, address, pte);
> + entry = pte_wrprotect(entry);
> + set_pte_at(mm, address, pte, entry);
> + }
> + ret = 1;
> +
> +out_unlock:
> + pte_unmap_unlock(pte, ptl);
> +out:
> + return ret;
> +}

OK.  I think.  We need to find a way of provoking Hugh to look at it.

> +static int page_wrprotect_file(struct page *page, int *odirect_sync)
> +{
> + struct address_space *mapping;
> + struct prio_tree_iter iter;
> + struct vm_area_struct *vma;
> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> + int ret = 0;
> +
> + mapping = page_mapping(page);

What pins *mapping in memory?  Usually this is done by requiring that
the caller has locked the page.  But no such precondition is documented
here.

> + if (!mapping)
> + return ret;
> +
> + spin_lock(&mapping->i_mmap_lock);
> +
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
> + ret += page_wrprotect_one(page, vma, odirect_sync);
> +
> + spin_unlock(&mapping->i_mmap_lock);
> +
> + return ret;
> +}
> +
> +static int page_wrprotect_anon(struct page *page, int *odirect_sync)
> +{
> + struct vm_area_struct *vma;
> + struct anon_vma *anon_vma;
> + int ret = 0;
> +
> + anon_vma = page_lock_anon_vma(page);
> + if (!anon_vma)
> + return ret;
> +
> + list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
> + ret += page_wrprotect_one(page, vma, odirect_sync);
> +
> + page_unlock_anon_vma(anon_vma);
> +
> + return ret;
> +}
> +
> +/**

This token means "this is a kerneldoc comment".

> + * set all the ptes pointed to a page as read only,
> + * odirect_sync is set to 0 in case we cannot protect against race with 
> odirect
> + * return the number of ptes that were set as read only
> + * (ptes that were read only before this function was called are couned as 
> well)
> + */

But it isn't.

I don't understand this odirect_sync thing.  What race?  Please expand
this comment to make the function of odirect_sync more understandable.

> +int page_wrprotect(struct page *page, int *odirect_sync)
> +{
> + int ret =0;
> +
> + *odirect_sync = 1;
> + if (PageAnon(page))
> + ret = page_wrprotect_anon(page, odirect_sync);
> + else
> + ret = page_wrprotect_file(page, odirect_sync);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(page_wrprotect);

What do you think about making all this new code dependent upon some
CONFIG_ switch which CONFIG_KVM can select?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 21:18:23 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> > hm.
> >
> > There has been the occasional discussion about idenfifying all-zeroes
> > pages and scavenging them, repointing them at the zero page.  Could
> > this infrastructure be used for that?  (And how much would we gain from
> > it?)
> >
> > [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
> > thing here ;) ]

^^ this?

> KSM is separate driver , it doesn't change anything in the VM but adding 
> two helper functions.

What, you mean I should actually read the code?   Oh well, OK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Avi Kivity

Andrew Morton wrote:
For kvm, the kernel never knew those pages were shared.  They are loaded 
from independent (possibly compressed and encrypted) disk images.  These 
images are different; but some pages happen to be the same because they 
came from the same installation media.



What userspace-only changes could fix this?  Identify the common data,
write it to a flat file and mmap it, something like that?

  


This was considered.  You can't scan the image, because it may be 
encrypted/compressed/offset (typical images _are_ offset because the 
first partition starts at sector 63...).  The data may come from the 
network and not a disk image.  You can't scan in userspace because the 
images belong to different users and contain sensitive data.  Pages may 
come from several images (multiple disk images per guest) so you end up 
with one vma per page.


So you have to scan memory, after the guest has retrieved it from 
disk/network/manufactured it somehow, decompressed and encrypted it, 
written it to the offset it wants.  You can't scan from userspace since 
it's sensitive data, and of course the actual merging need to be done 
atomically, which can only be done from the holy of holies, the vm.


For OpenVZ the situation is less clear, but if you allow users to 
independently upgrade their chroots you will eventually arrive at the 
same scenario (unless of course you apply the same merging strategy at 
the filesystem level).



hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  


Yes, trivially.  ksm may be an overkill for this, though.


(And how much would we gain from
it?)
  


A lot of zeros.


[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]
  


I sympathize -- us too.  Consider the typical multiuser gnome 
minicomputer with all 150 users reading lwn.net at the same time instead 
of working.  You could share the firefox rendered page cache, reducing 
memory utilization drastically.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Solaris 10 x86_64 guest on kvm keeps restarting

2008-11-11 Thread Carlo Marcelo Arenas Belon
On Tue, Nov 11, 2008 at 06:51:26PM +0530, mr ashok wrote:
> Hi all,
> 
> I have a F9 box with 2.6.27.4-58.fc10.x86_64 version installed with
> kvm-74-5.fc10.x86_64 installed.

you are going to need a patched version of the kvm module to be able to run
Solaris 10 amd64 as explained in the Guest Support Status page.

patch is available from :

  
http://tapir.sajinet.com.pe/gentoo/portage/app-emulation/kvm/files/kvm-57-kernel-longmode.patch

> The solaris fail-safe does work though

fail-safe runs in 32bit mode and doesn't need the patch.

> Any ideas as to why this happens.

because of the following bug :

  
http://sourceforge.net/tracker/index.php?func=detail&aid=1842160&group_id=180599&atid=893831

it has been apparently worked around in OpenSolaris, but the patch from kvm
was reverted because it somehow affected windows 64bit, which I am sure is by
far a more common guest.

Carlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 21:07:10 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> we have used KSM in production for about half year and the numbers that 
> came from our QA is:
> using KSM for desktop (KSM was tested just for windows desktop workload) 
> you can run as many as
> 52 windows xp with 1 giga ram each on server with just 16giga ram. (this 
> is more than 300% overcommit)
> the reason is that most of the kernel/dlls of this guests is shared and 
> in addition we are sharing the windows zero
> (windows keep making all its free memory as zero, so every time windows 
> release memory we take the page back to the host)
> there is slide that give this numbers you can find at:
> http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_3.pdf
>  
> (slide 27)
> beside more i gave presentation about ksm that can be found at:
> http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_12.pdf

OK, 300% isn't chicken feed.

It is quite important that information such as this be prepared, added to
the patch changelogs and maintained.  For a start, without this basic
information, there is no reason for anyone to look at any of the code!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus

Andrew Morton wrote:

On Tue, 11 Nov 2008 20:48:16 +0200
Avi Kivity <[EMAIL PROTECTED]> wrote:

  

Andrew Morton wrote:


The whole approach seems wrong to me.  The kernel lost track of these
pages and then we run around post-facto trying to fix that up again. 
Please explain (for the changelog) why the kernel cannot get this right

via the usual sharing, refcounting and COWing approaches.
  
  
For kvm, the kernel never knew those pages were shared.  They are loaded 
from independent (possibly compressed and encrypted) disk images.  These 
images are different; but some pages happen to be the same because they 
came from the same installation media.



What userspace-only changes could fix this?  Identify the common data,
write it to a flat file and mmap it, something like that?

  
For OpenVZ the situation is less clear, but if you allow users to 
independently upgrade their chroots you will eventually arrive at the 
same scenario (unless of course you apply the same merging strategy at 
the filesystem level).



hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  (And how much would we gain from
it?)

[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]
KSM is separate driver , it doesn't change anything in the VM but adding 
two helper functions.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 20:48:16 +0200
Avi Kivity <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > The whole approach seems wrong to me.  The kernel lost track of these
> > pages and then we run around post-facto trying to fix that up again. 
> > Please explain (for the changelog) why the kernel cannot get this right
> > via the usual sharing, refcounting and COWing approaches.
> >   
> 
> For kvm, the kernel never knew those pages were shared.  They are loaded 
> from independent (possibly compressed and encrypted) disk images.  These 
> images are different; but some pages happen to be the same because they 
> came from the same installation media.

What userspace-only changes could fix this?  Identify the common data,
write it to a flat file and mmap it, something like that?

> For OpenVZ the situation is less clear, but if you allow users to 
> independently upgrade their chroots you will eventually arrive at the 
> same scenario (unless of course you apply the same merging strategy at 
> the filesystem level).

hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  (And how much would we gain from
it?)

[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus

Avi Kivity wrote:

Andrew Morton wrote:

The whole approach seems wrong to me.  The kernel lost track of these
pages and then we run around post-facto trying to fix that up again. 
Please explain (for the changelog) why the kernel cannot get this right

via the usual sharing, refcounting and COWing approaches.
  


For kvm, the kernel never knew those pages were shared.  They are 
loaded from independent (possibly compressed and encrypted) disk 
images.  These images are different; but some pages happen to be the 
same because they came from the same installation media.


As Avi said, in kvm we cannot know how the guest is going to map its 
pages, we have nothing to do but to scan for the identical pages
(you can have pages that are shared that are in whole different offset 
inside the guest)




For OpenVZ the situation is less clear, but if you allow users to 
independently upgrade their chroots you will eventually arrive at the 
same scenario (unless of course you apply the same merging strategy at 
the filesystem level).




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus

Andrew Morton wrote:

On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote:

  

KSM is a linux driver that allows dynamicly sharing identical memory pages
between one or more processes.

unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and merged.
the sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

this driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type, many pages are sharable.
this driver can be useful by OpenVZ as well.



These benefits should be quantified, please.  Also any benefits to any
other workloads should be identified and quantified.
  

Sure,
we have used KSM in production for about half year and the numbers that 
came from our QA is:
using KSM for desktop (KSM was tested just for windows desktop workload) 
you can run as many as
52 windows xp with 1 giga ram each on server with just 16giga ram. (this 
is more than 300% overcommit)
the reason is that most of the kernel/dlls of this guests is shared and 
in addition we are sharing the windows zero
(windows keep making all its free memory as zero, so every time windows 
release memory we take the page back to the host)

there is slide that give this numbers you can find at:
http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_3.pdf 
(slide 27)

beside more i gave presentation about ksm that can be found at:
http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFile&do=get&target=kdf2008_12.pdf

if more numbers are wanted for other workloads i can test it.
(the idea of ksm is to run it slowly slowy at low priority and let it 
merge pages when no one need the cpu)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread Avi Kivity

Avi Kivity wrote:
Ok.  Since 2.6.25 is out of maintenance mode I suggest you forward 
this patch to your distro's kernel maintainers.  I'll queue it for 
2.6.2[67].




2.6.26.1 has the fix, so all you need to do is upgrade to that or a 
later version.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Avi Kivity

Andrew Morton wrote:

The whole approach seems wrong to me.  The kernel lost track of these
pages and then we run around post-facto trying to fix that up again. 
Please explain (for the changelog) why the kernel cannot get this right

via the usual sharing, refcounting and COWing approaches.
  


For kvm, the kernel never knew those pages were shared.  They are loaded 
from independent (possibly compressed and encrypted) disk images.  These 
images are different; but some pages happen to be the same because they 
came from the same installation media.


For OpenVZ the situation is less clear, but if you allow users to 
independently upgrade their chroots you will eventually arrive at the 
same scenario (unless of course you apply the same merging strategy at 
the filesystem level).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote:

> KSM is a linux driver that allows dynamicly sharing identical memory pages
> between one or more processes.
> 
> unlike tradtional page sharing that is made at the allocation of the
> memory, ksm do it dynamicly after the memory was created.
> Memory is periodically scanned; identical pages are identified and merged.
> the sharing is unnoticeable by the process that use this memory.
> (the shared pages are marked as readonly, and in case of write
> do_wp_page() take care to create new copy of the page)
> 
> this driver is very useful for KVM as in cases of runing multiple guests
> operation system of the same type, many pages are sharable.
> this driver can be useful by OpenVZ as well.

These benefits should be quantified, please.  Also any benefits to any
other workloads should be identified and quantified.

The whole approach seems wrong to me.  The kernel lost track of these
pages and then we run around post-facto trying to fix that up again. 
Please explain (for the changelog) why the kernel cannot get this right
via the usual sharing, refcounting and COWing approaches.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hrtimer_start: Unknown symbol in module (kvm-78)

2008-11-11 Thread Martin Vogt
Martin Vogt wrote:
> Hello list,
> 
> I just downloaded kvm-78 and after I build it it writes:
> 
>> mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel # modprobe kvm
>> FATAL: Error inserting kvm
>> (/lib/modules/2.6.16.54-0.2.5-smp/extra/kvm.ko): Unknown symbol in
>> module, or unknown parameter (see dmesg)
>> mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel #  dmesg | tail -2
>> kvm: module not supported by Novell, setting U taint flag.
>> kvm: Unknown symbol hrtimer_start
> 
> Its on a SuSE Enterprise, which is a 2.6.16++
> 
> 

>Is this a regression?  That is, did earlier versions build (which?)
>We did add SLES compatibility at one point.

I'm currently runnin kvm-77 on 2.6.16.54-0.2.5-smp and updated
today to kvm-78, because I tried SuSE 1.11-beta4 in kvm.

(beta4 didnt work with kvm-77, therefore I tried that update--
which had this missing symbol.)

Overall kvm-77 worked so far pretty well on that kernel.

regards,

Martin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread Avi Kivity

the uni wrote:

That fixed it. Thanks.
  


Ok.  Since 2.6.25 is out of maintenance mode I suggest you forward this 
patch to your distro's kernel maintainers.  I'll queue it for 2.6.2[67].


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of "page size"

2008-11-11 Thread Avi Kivity

Hollis Blanchard wrote:

Avi, will you commit the matching qemu patch? (I'm not clear on the
patch flow for kvm-userspace qemu virtio patches.)
  


At this time, I'm accepting qemu/virtio patches (and will apply this 
one); once virtio is merged in qemu upstream, patches will have to go 
there as well.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread the uni
That fixed it. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [v2] linux: virtio: Standardize virtio's concept of "page size"

2008-11-11 Thread Hollis Blanchard
On Tue, 2008-11-11 at 23:47 +1030, Rusty Russell wrote:
> On Tuesday 11 November 2008 10:07:09 Hollis Blanchard wrote:
> 
> > Both sides of the virtio interface must agree about how big a pfn
> really
> 
> > is. This is particularly an issue on architectures where the page
> size is
> 
> > configurable (e.g. PowerPC, IA64) -- the interface must be
> independent of
> 
> > PAGE_SHIFT.
> 
> Thanks Hollis! Applied, with the following overpatch:

Avi, will you commit the matching qemu patch? (I'm not clear on the
patch flow for kvm-userspace qemu virtio patches.)

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread Avi Kivity

the uni wrote:

Can you try modules from older versions (keeping userspace and the bios) to
see which version introduced the fix?




Keeping userspace and bios, versions 71 and up are fixed... 70 is broken.
  


Please try kvm-70 with the attached patch attached.

(from 214fc251236fd1d0e0f2b7c3b7b7ad5243928fa9)

--
error compiling committee.c: too many arguments to function

commit 214fc251236fd1d0e0f2b7c3b7b7ad5243928fa9
Author: Mohammed Gamal <[EMAIL PROTECTED]>
Date:   Sun Jul 6 16:51:26 2008 +0300

KVM: x86 emulator: Fix HLT instruction

This patch fixes issue encountered with HLT instruction
under FreeDOS's HIMEM XMS Driver.

The HLT instruction jumped directly to the done label and
skips updating the EIP value, therefore causing the guest
to spin endlessly on the same instruction.

The patch changes the instruction so that it writes back
the updated EIP value.

Signed-off-by: Mohammed Gamal <[EMAIL PROTECTED]>
Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>

diff --git a/kernel/x86_emulate.c b/kernel/x86_emulate.c
index dd4efe1..62e71b6 100644
--- a/kernel/x86_emulate.c
+++ b/kernel/x86_emulate.c
@@ -1732,7 +1732,7 @@ special_insn:
break;
case 0xf4:  /* hlt */
ctxt->vcpu->arch.halt_request = 1;
-   goto done;
+   break;
case 0xf5:  /* cmc */
/* complement carry flag from eflags reg */
ctxt->eflags ^= EFLG_CF;


Re: Boot Problem (and fix)

2008-11-11 Thread the uni
> Can you try modules from older versions (keeping userspace and the bios) to
> see which version introduced the fix?


Keeping userspace and bios, versions 71 and up are fixed... 70 is broken.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] KVM: Fix kernel allocated memory slot

2008-11-11 Thread Hollis Blanchard
On Tue, 2008-11-11 at 15:30 +0800, Sheng Yang wrote:
> Commit 7fd49de9773fdcb7b75e823b21c1c5dc1e218c14 "KVM: ensure that memslot
> userspace addresses are page-aligned" broke kernel space allocated memory
> slot, for the userspace_addr is invalid.
> 
> Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
> ---
>  virt/kvm/kvm_main.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0a0a959..4727c08 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -715,7 +715,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   goto out;
>   if (mem->guest_phys_addr & (PAGE_SIZE - 1))
>   goto out;
> - if (mem->userspace_addr & (PAGE_SIZE - 1))
> + if (user_alloc && (mem->userspace_addr & (PAGE_SIZE - 1)))
>   goto out;
>   if (mem->slot >= KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
>   goto out;

Wow, I didn't realize we still had kernel-allocated guest memory. Why is
that?

Anyways, the above patch seems fine to me.
Acked-by: Hollis Blanchard <[EMAIL PROTECTED]>

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hrtimer_start: Unknown symbol in module (kvm-78)

2008-11-11 Thread Avi Kivity

Martin Vogt wrote:

Hello list,

I just downloaded kvm-78 and after I build it it writes:

  

mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel # modprobe kvm
FATAL: Error inserting kvm
(/lib/modules/2.6.16.54-0.2.5-smp/extra/kvm.ko): Unknown symbol in
module, or unknown parameter (see dmesg)
mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel #  dmesg | tail -2
kvm: module not supported by Novell, setting U taint flag.
kvm: Unknown symbol hrtimer_start



Its on a SuSE Enterprise, which is a 2.6.16++

  


Is this a regression?  That is, did earlier versions build (which?)

We did add SLES compatibility at one point.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hrtimer_start: Unknown symbol in module (kvm-78)

2008-11-11 Thread Martin Vogt

Hello list,

I just downloaded kvm-78 and after I build it it writes:

>mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel # modprobe kvm
>FATAL: Error inserting kvm
>(/lib/modules/2.6.16.54-0.2.5-smp/extra/kvm.ko): Unknown symbol in
>module, or unknown parameter (see dmesg)
>mediag0a:/usr/share/slg_qemu_kvm/kvm-78/kernel #  dmesg | tail -2
>kvm: module not supported by Novell, setting U taint flag.
>kvm: Unknown symbol hrtimer_start

Its on a SuSE Enterprise, which is a 2.6.16++

regards,

Martin


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Izik Eidus
From: Izik Eidus <[EMAIL PROTECTED]>

this function is useful for cases you want to compare page and know
that its value wont change during you compare it.

this function is working by walking over the whole rmap of a page
and mark every pte related to the page as write_protect.

the odirect_sync paramter is used to notify the caller of
page_wrprotect() if one pte or more was not marked readonly
in order to avoid race with odirect.

Signed-off-by: Izik Eidus <[EMAIL PROTECTED]>
---
 include/linux/rmap.h |7 
 mm/rmap.c|   97 ++
 2 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 89f0564..2a37fb7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,8 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+int page_wrprotect(struct page *page, int *odirect_sync);
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -135,6 +137,11 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+static inline int page_wrprotect(struct page *page, int *odirect_sync)
+{
+   return 0;
+}
+
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1099394..3684edd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, &ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   if (page_mapcount(page) != page_count(page)) {
+   *odirect_sync = 0;
+   goto out_unlock;
+   }
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   entry = ptep_clear_flush_notify(vma, address, pte);
+   entry = pte_wrprotect(entry);
+   set_pte_at(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(&mapping->i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync);
+
+   spin_unlock(&mapping->i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
+   ret += page_wrprotect_one(page, vma, odirect_sync);
+
+   page_unlock_anon_vma(anon_vma);
+
+   return ret;
+}
+
+/**
+ * set all the ptes pointed to a page as read only,
+ * odirect_sync is set to 0 in case we cannot protect against race with odirect
+ * return the number of ptes that were set as read only
+ * (ptes that were read only before this function was called are couned as 
well)
+ */
+int page_wrprotect(struct page *page, int *odirect_sync)
+{
+   int ret =0;
+
+   *odirect_sync = 1;
+   if (PageAnon(page))
+   ret = page_wrprotect_anon(page, odirect_sync);
+   else
+   ret = page_wrprotect_file(page, odirect_sync);
+
+   return ret;
+}
+EXPORT_SYMBOL(page_wrprotect);
+
 /**
  * __page_set_anon_rmap - setup new anonymous rmap
  * @page:  the page to add the mapping to
-- 
1.6.0.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] MMU_NOTIFIRES: add set_pte_at_notify()

2008-11-11 Thread Izik Eidus
From: Izik Eidus <[EMAIL PROTECTED]>

this function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus <[EMAIL PROTECTED]>
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   55 ++-
 include/linux/mmu_notifier.h|   33 +++
 mm/memory.c |   12 ++--
 mm/mmu_notifier.c   |   20 ++
 virt/kvm/kvm_main.c |   14 ++
 6 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 65679d0..a5d01d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -748,5 +748,6 @@ asmlinkage void kvm_handle_fault_on_reboot(void);
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 99c239c..652a51c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -663,7 +663,8 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
kvm_flush_remote_tlbs(kvm);
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
+  unsigned long data)
 {
u64 *spte;
int need_tlb_flush = 0;
@@ -678,8 +679,41 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
return need_tlb_flush;
 }
 
+static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data)
+{
+   u64 *spte, new_spte;
+   u64 *cur_spte;
+   pte_t *ptep = (pte_t *)data;
+   pte_t pte;
+   struct page *new_page;
+   struct page *old_page;
+
+   pte = *ptep;
+   new_page = pfn_to_page(pte_pfn(pte));
+   cur_spte = rmap_next(kvm, rmapp, NULL);
+   while (cur_spte) {
+   spte = cur_spte;
+   BUG_ON(!(*spte & PT_PRESENT_MASK));
+   rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
+   new_spte = *spte &~ (PT64_BASE_ADDR_MASK);
+   new_spte |= pte_pfn(pte);
+   if (!pte_write(pte))
+   new_spte &= ~PT_WRITABLE_MASK;
+   old_page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
PAGE_SHIFT);
+   get_page(new_page);
+   cur_spte = rmap_next(kvm, rmapp, spte);
+   set_shadow_pte(spte, new_spte);
+   kvm_flush_remote_tlbs(kvm);
+   put_page(old_page);
+   }
+   return 0;
+}
+
 static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
- int (*handler)(struct kvm *kvm, unsigned long *rmapp))
+ unsigned long data,
+ int (*handler)(struct kvm *kvm, unsigned long *rmapp,
+unsigned long data))
 {
int i;
int retval = 0;
@@ -700,11 +734,12 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
end = start + (memslot->npages << PAGE_SHIFT);
if (hva >= start && hva < end) {
gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
-   retval |= handler(kvm, &memslot->rmap[gfn_offset]);
+   retval |= handler(kvm, &memslot->rmap[gfn_offset], 
data);
retval |= handler(kvm,
  &memslot->lpage_info[
  gfn_offset /
- 
KVM_PAGES_PER_HPAGE].rmap_pde);
+ KVM_PAGES_PER_HPAGE].rmap_pde,
+ data);
}
}
 
@@ -713,10 +748,16 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-   return kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
+   return kvm_handle_hva(kvm, hva, 0, kvm_unmap_rmapp);
+}
+
+void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+{
+   kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
 }
 
-static int kvm_age_rmapp(st

[PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Izik Eidus
From: Izik Eidus <[EMAIL PROTECTED]>

ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
pages that are merged are marked as readonly and are COWed when any application
try to change them.

ksm is working by walking over the memory pages of the applications it scan
in order to find identical pages.
it uses an hash table to find in effective way the identical pages.

when ksm find two identical pages, it marked them as readonly and merge them
into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

ksm scan just memory areas that were registred to be scanned by it.

Signed-off-by: Izik Eidus <[EMAIL PROTECTED]>
---
 drivers/Kconfig|5 +
 include/linux/ksm.h|   53 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |3 +
 mm/Makefile|1 +
 mm/ksm.c   | 1202 
 6 files changed, 1265 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d38f43f..c1c701f 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -105,4 +105,9 @@ source "drivers/uio/Kconfig"
 source "drivers/xen/Kconfig"
 
 source "drivers/staging/Kconfig"
+
+config KSM
+   bool "KSM driver support"
+   help
+   ksm is driver for merging identical pages between applciations
 endmenu
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..f873502
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,53 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include 
+#include 
+
+#define KSM_API_VERSION 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+};
+
+struct ksm_user_scan {
+   __u32 pages_to_scan;
+   __u32 max_pages_to_merge;
+};
+
+struct ksm_kthread_info {
+   __u32 sleep; /* number of microsecoends to sleep */
+   __u32 pages_to_scan; /* number of pages to scan */
+   __u32 max_pages_to_merge;
+   __u32 running;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+#define KSM_CREATE_SCAN  _IO(KSMIO,   0x02) /* return SCAN fd 
*/
+#define KSM_START_STOP_KTHREAD  _IOW(KSMIO,  0x03,\
+ struct ksm_kthread_info)
+#define KSM_GET_INFO_KTHREAD_IOW(KSMIO,  0x04,\
+ struct ksm_kthread_info) 
+
+
+/* ioctls for SMA fds */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+/* ioctls for SCAN fds */
+#define KSM_SCAN _IOW(KSMIO,  0x40,\
+ struct ksm_user_scan)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 26433ec..adc2435 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define TUN_MINOR   200
 #defineHPET_MINOR   228
 #define KVM_MINOR232
+#define KSM_MINOR233
 
 struct device;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 5b5790f..e7f0061 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -222,3 +222,6 @@ config UNEVICTABLE_LRU
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   bool
diff --git a/mm/Makefile b/mm/Makefile
index c06b45a..9722afe 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..977eb37
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1202 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 2008 Red Hat, Inc.
+ * Authors:
+ * Izik Eidus
+ * Andrea Arcangeli
+ * Chris Wright
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 

[PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus
From: Izik Eidus <[EMAIL PROTECTED]>

this function is needed in cases you want to change the userspace
virtual mapping into diffrent physical page,
KSM need this for merging the identical pages.

this function is working by removing the oldpage from the rmap and
calling put_page on it, and by setting the virtual address pte
to point into the new page.
(note that the new page (the page that we change the pte to map to)
cannot be anonymous page)

Signed-off-by: Izik Eidus <[EMAIL PROTECTED]>
---
 include/linux/mm.h |3 ++
 mm/memory.c|   68 
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..4da7fa8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..b2c542c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+/**
+ * replace _page - replace the pte mapping related to vm area between two pages
+ * (from oldpage to newpage)
+ * NOTE: you should take into consideration the impact on the VM when replacing
+ * anonymous pages with kernel non swappable pages.
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage, vma);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL(replace_page);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.6.0.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Izik Eidus
KSM is a linux driver that allows dynamicly sharing identical memory pages
between one or more processes.

unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and merged.
the sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

this driver is very useful for KVM as in cases of runing multiple guests
operation system of the same type, many pages are sharable.
this driver can be useful by OpenVZ as well.

KSM right now scan just memory that was registered to used by it, it
does not
scan the whole system memory (this can be changed, but the changes to
find
identical pages in normal linux system that doesnt run multiple guests)

KSM can run as kernel thread or as userspace application (or both (it is
allowed to run more than one scanner in a time)).

example for how to control the kernel thread:


ksmctl.c

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "ksm.h"

int main(int argc, char *argv[])
{
int fd;
int used = 0;
int fd_start;
struct ksm_kthread_info info;


if (argc < 2) {
fprintf(stderr, "usage: %s {start npages sleep | stop |
info}\n", argv[0]);
exit(1);
}

fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
if (fd == -1) {
fprintf(stderr, "could not open /dev/ksm\n");
exit(1);
}

if (!strncmp(argv[1], "start", strlen(argv[1]))) {
used = 1;
if (argc < 5) {
fprintf(stderr, "usage: %s start npages_to_scan",
argv[0]);
fprintf(stderr, "npages_max_merge sleep\n");
exit(1);
}
info.pages_to_scan = atoi(argv[2]);
info.max_pages_to_merge = atoi(argv[3]);
info.sleep = atoi(argv[4]);
info.running = 1;

fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_START_KTHREAD failed\n");
exit(1);
}
printf("created scanner\n");
}

if (!strncmp(argv[1], "stop", strlen(argv[1]))) {
used = 1;
info.running = 0;
fd_start = ioctl(fd, KSM_START_STOP_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_START_STOP_KTHREAD failed\n");
exit(1);
}
printf("stopped scanner\n");
}

if (!strncmp(argv[1], "info", strlen(argv[1]))) {
used = 1;
fd_start = ioctl(fd, KSM_GET_INFO_KTHREAD, &info);
if (fd_start == -1) {
fprintf(stderr, "KSM_GET_INFO_KTHREAD failed\n");
exit(1);
}
printf("running %d, pages_to_scan %d pages_max_merge %d",
info.running, info.pages_to_scan,
info.max_pages_to_merge);
printf("sleep_time %d\n", info.sleep);
}

if (!used)
fprintf(stderr, "unknown command %s\n", argv[1]);

return 0;
}


example of how to register qemu to ksm (or any userspace application)

diff --git a/qemu/vl.c b/qemu/vl.c
index 4721fdd..7785bf9 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -21,6 +21,7 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
  * DEALINGS IN
  * THE SOFTWARE.
  */
+#include "ksm.h"
 #include "hw/hw.h"
 #include "hw/boards.h"
 #include "hw/usb.h"
@@ -5799,6 +5800,37 @@ static void termsig_setup(void)
 
 #endif
 
+int ksm_register_memory(void)
+{
+int fd;
+int ksm_fd;
+int r = 1;
+struct ksm_memory_region ksm_region;
+
+fd = open("/dev/ksm", O_RDWR | O_TRUNC, (mode_t)0600);
+if (fd == -1)
+goto out;
+
+ksm_fd = ioctl(fd, KSM_CREATE_SHARED_MEMORY_AREA);
+if (ksm_fd == -1)
+goto out_free;
+
+ksm_region.npages = phys_ram_size / TARGET_PAGE_SIZE;
+ksm_region.addr = phys_ram_base;
+r = ioctl(ksm_fd, KSM_REGISTER_MEMORY_REGION, &ksm_region);
+if (r)
+goto out_free1;
+
+return r;
+
+out_free1:
+close(ksm_fd);
+out_free:
+close(fd);
+out:
+return r;
+}
+
 int main(int argc, char **argv)
 {
 #ifdef CONFIG_GDBSTUB
@@ -6735,6 +6767,8 @@ int main(int argc, char **argv)
 /* init the dynamic translator */
 cpu_exec_init_all(tb_size * 1024 * 1024);
 
+ksm_register_memory();
+
 bdrv_init();
 
 /* we always create the cdrom drive, even if no disk is there */
--
To uns

Solaris 10 x86_64 guest on kvm keeps restarting

2008-11-11 Thread mr ashok
Hi all,

I have a F9 box with 2.6.27.4-58.fc10.x86_64 version installed with
kvm-74-5.fc10.x86_64 installed.

I have installed solaris 10 x86_64 , the installation went fine but
after installation, the solaris 10 after the grub, where it starts the
solaris, it again  restarts. So
the solaris guest keeps restarting.  The solaris fail-safe does work though

The solaris guests is started with below options

"/usr/bin/qemu-kvm -S -M pc -m 700 -smp 1 -name solaris-10-x86_64
-monitor pty -boot c -drive
file=/dev/kvm-guests/solaris-10-x86-64,if=ide,index=0,boot=on -net
nic,macaddr=54:52:00:45:88:32,vlan=0 -net
tap,fd=23,script=,vlan=0,ifname=vnet3 -serial pty -parallel none -usb
-vnc 127.0.0.1:3 -k en-us -soundhw es1370"

On the host system , the following messages keeps coming


set_cr4: #GP, clearing PAE while in long mode
set_cr4: #GP, clearing PAE while in long mode
set_cr4: #GP, clearing PAE while in long mode
set_cr4: #GP, clearing PAE while in long mode
set_cr4: #GP, clearing PAE while in long mode


Any ideas as to why this happens.

Regards
Ashok
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Boot Problem (and fix)

2008-11-11 Thread Avi Kivity

the uni wrote:

Confirmed that vers 77 and 78 both work fine.
  


Can you try modules from older versions (keeping userspace and the bios) 
to see which version introduced the fix?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] remove vcpu_info array v5

2008-11-11 Thread Avi Kivity

Jes Sorensen wrote:

Avi Kivity wrote:
The code I mentioned only runs if the -no-kvm-irqchip option is 
passed.  It's not the highest performing option...


What I meant was that I was able to compile the code, and there was only
one piece left that needed that function, which is why I moved it and
made it static in the acpi code.



That's because there is another static array in acpi.c...

So this isn't used a lot.  But cpus definitely use cpu numbers (as 
apic ids), so qemu needs to be prepared to handle this.  As to 
scalability, that takes will take a lot more work than 
removing/changing arbitrary limits.  Look at qemu_mutex and weep.


Well we have to start somewhere. Just because there's limits in other
areas too, doesn't mean we should discard valid improvements elsewhere.
I'll get to some of that at some point, but there's soo many cards used
to build this house


Your patch builds the 16383rd floor while ignoring the 15th.  Qemu 
scaling issues are much, much more painful than what you were addressing.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-11 Thread Avi Kivity

Greg KH wrote:

 

arch/x86/kvm/vtd.c: iommu integration (allows assigning the device's memory 
resources)



That file is not in 2.6.28-rc4 :(

  


Sorry, was moved to virt/kvm/ for ia64's benefit.

  
virt/kvm/irq*: interrupt redirection (allows assigning the device's 
interrupt resources)



I only see virt/kvm/irq_comm.c in 2.6.28-rc4.

  


kvm_main.c in that directory also has some related bits.


the rest (pci config space, pio redirection) are in userspace.



So you don't need these pci core changes at all?

  


Not beyond those required for SR-IOV and iommu support.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html