Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-13 Thread Benoit Hudzia
On 13 January 2012 02:15, Isaku Yamahata  wrote:
> One more question.
> Does your architecture/implementation (in theory) allow KVM memory
> features like swap, KSM, THP?

* Swap: Yes we support swap to disk ( the page is pulled from swap
before being send over), swap process do its job on the other side.
* KSM :  same , we support KSM, the KSMed page is broken down and
split and they are send individually ( yes sub optimal but make the
protocol less messy) and we let the KSM daemon do its job on the other
side.
* THP : more sticky here. Due to time constraint we decided that we
will be partially supporting it. What does it means: if we encounter
THP we break them down in standard page granularity as it is our
current memory unit we are manipulating. As a result you can have THP
on the source but you won't have THP on the other side.
   _ Note , we didn't explore fully the ramification of THP
with RDMA, i don't know if THP play well with the MMU of HW RDMA NIC,
One thing i would like to explore is if it is possible to break down
the THP in standard page and then reassemble them on the other side (
do any one fo you know if it is possible to aggregate page to for a
THP in kernel ? )
* cgroup  :  should be transparently working, but we need to do more
testing to confirm that .




>
>
> On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
>> Very interesting. We can cooperate for better (postcopy) live migration.
>> The code doesn't seem available yet, I'm eager for it.
>>
>>
>> On Fri, Jan 13, 2012 at 01:09:30AM +, Benoit Hudzia wrote:
>> > Hi,
>> >
>> > Sorry to jump to hijack the thread  like that , however i would like
>> > to just to inform you  that we recently achieve a milestone out of the
>> > research project I'm leading. We enhanced KVM in order to deliver
>> > post copy live migration using RDMA at kernel level.
>> >
>> > Few point on the architecture of the system :
>> >
>> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> > ROCE if you don't have hardware acceleration, however we also support
>> > standard RDMA enabled NIC) .
>>
>> Do you mean infiniband subsystem?
>>
>>
>> > * Naturally Page are transferred with Zerop copy protocol
>> > * Leverage the async page fault system.
>> > * Pre paging / faulting
>> > * No context switch as everything is handled within kernel and using
>> > the page fault system.
>> > * Hybrid migration ( pre + post copy) available
>>
>> Ah, I've been also planing this.
>> After pre-copy phase, is the dirty bitmap sent?
>>
>> So far I've thought naively that pre-copy phase would be finished by the
>> number of iterations. On the other hand your choice is timeout of
>> pre-copy phase. Do you have rationale? or it was just natural for you?
>>
>>
>> > * Rely on an independent Kernel Module
>> > * No modification to the KVM kernel Module
>> > * Minimal Modification to the Qemu-Kvm code
>> > * We plan to add the page prioritization algo in order to optimise the
>> > pre paging algo and background transfer
>>
>> Where do you plan to implement? in qemu or in your kernel module?
>> This algo could be shared.
>>
>> thanks in advance.
>>
>> > You can learn a little bit more and see a demo here:
>> > http://tinyurl.com/8xa2bgl
>> > I hope to be able to provide more detail on the design soon. As well
>> > as more concrete demo of the system ( live migration of VM running
>> > large  enterprise apps such as ERP or In memory DB)
>> >
>> > Note: this is just a step stone as the post copy live migration mainly
>> > enable us to validate the architecture design and  code.
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> > On 12 January 2012 13:59, Avi Kivity  wrote:
>> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> > >> And it would be easy to convert a separated daemon process into a thread
>> > >> in qemu.
>> > >>
>> > >> I think it should be done out side of qemu process for some reasons.
>> > >> (I just repeat same discussion at the KVM-forum because no one remembers
>> > >> it)
>> > >>
>> > >> - ptrace (and its variant)
>> > >> ?? Some people want to investigate guest ram on host (qemu stopped or 
>> > >> lively).
>> > >> ?? For example, enhance crash utility and it will attach qemu process 
>> > >> and
>> > >> ?? debug guest kernel.
>> > >
>> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > > it's a problem for qemu debugging though.
>> > >
>> > >>
>> > >> - core dump
>> > >> ?? qemu process may core-dump.
>> > >> ?? As postmortem analysis, people want to investigate guest RAM.
>> > >> ?? Again enhance crash utility and it will read the core file and 
>> > >> analyze
>> > >> ?? guest kernel.
>> > >> ?? When creating core, the qemu process is already dead.
>> > >
>> > > Yes, strong point.
>> > >
>> > >> It precludes the above possibilities to handle fault in qemu p

Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-13 Thread Benoit Hudzia
On 13 January 2012 02:03, Isaku Yamahata  wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
>
>
> On Fri, Jan 13, 2012 at 01:09:30AM +, Benoit Hudzia wrote:
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>
> Do you mean infiniband subsystem?

Yes, basically any software or hardware implementation that support
the standard RDMA / OFED vverbs  stack in kernel.
>
>
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?

We sent over the dirty bitmap yes. In order to identify what is left
to be transferred . And combined with the priority algo we will then
prioritise the page for the background transfer.

>
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?


The main rational behind that is any normal sys admin tend to to be
human and live migration iteration cycle has no meaning for him. As a
result we preferred to provide a time constraint rather than an
iteration constraint. Also it is hard to estimate how much time
bandwidth would be use per iteration cycle which led to poor
determinism.

>
>
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.

Yes we plan to actually release the algo first before the RDMA post
copy. The algo can be use for standard optimisation of the normal
pre-copy process (as demosntrated in my talk at KVM-forum). And if the
priority is reverse for the post copy page pull. My colleague Aidan
shribman is done with the implentation and we are now in testing phase
in order to quantify the improvement.


>
> thanks in advance.
>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>>
>> Regards
>> Benoit
>>
>>
>>
>>
>>
>>
>>
>> Regards
>> Benoit
>>
>>
>> On 12 January 2012 13:59, Avi Kivity  wrote:
>> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> >> And it would be easy to convert a separated daemon process into a thread
>> >> in qemu.
>> >>
>> >> I think it should be done out side of qemu process for some reasons.
>> >> (I just repeat same discussion at the KVM-forum because no one remembers
>> >> it)
>> >>
>> >> - ptrace (and its variant)
>> >> ?? Some people want to investigate guest ram on host (qemu stopped or 
>> >> lively).
>> >> ?? For example, enhance crash utility and it will attach qemu process and
>> >> ?? debug guest kernel.
>> >
>> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > it's a problem for qemu debugging though.
>> >
>> >>
>> >> - core dump
>> >> ?? qemu process may core-dump.
>> >> ?? As postmortem analysis, people want to investigate guest RAM.
>> >> ?? Again enhance crash utility and it will read the core file and analyze
>> >> ?? guest kernel.
>> >> ?? When creating core, the qemu process is already dead.
>> >
>> > Yes, strong point.
>> >
>> >> It precludes the above possibilities to handle fault in qemu process.
>> >
>> > I agree.
>> >
>> >
>> > --
>> > error compiling committee.c: too many arguments to function
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> " The production of too many useful things results in too many useless 
>> people"
>>
>
> --
> yamahata



-- 
" The production of too many usef

Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-13 Thread Benoit Hudzia
Yes we plan to release patch as soon as we cleaned up the code and we
get the green light from our company ( and sadly it can take month on
that point..)

On 13 January 2012 01:31, Takuya Yoshikawa
 wrote:
> (2012/01/13 10:09), Benoit Hudzia wrote:
>>
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>>
>>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>
>
> Do you have any plan to send the patch series of your implementation?
>
>        Takuya



-- 
" The production of too many useful things results in too many useless people"
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Isaku Yamahata
One more question.
Does your architecture/implementation (in theory) allow KVM memory
features like swap, KSM, THP?


On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
> 
> 
> On Fri, Jan 13, 2012 at 01:09:30AM +, Benoit Hudzia wrote:
> > Hi,
> > 
> > Sorry to jump to hijack the thread  like that , however i would like
> > to just to inform you  that we recently achieve a milestone out of the
> > research project I'm leading. We enhanced KVM in order to deliver
> > post copy live migration using RDMA at kernel level.
> > 
> > Few point on the architecture of the system :
> > 
> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
> > ROCE if you don't have hardware acceleration, however we also support
> > standard RDMA enabled NIC) .
> 
> Do you mean infiniband subsystem?
> 
> 
> > * Naturally Page are transferred with Zerop copy protocol
> > * Leverage the async page fault system.
> > * Pre paging / faulting
> > * No context switch as everything is handled within kernel and using
> > the page fault system.
> > * Hybrid migration ( pre + post copy) available
> 
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?
> 
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?
> 
> 
> > * Rely on an independent Kernel Module
> > * No modification to the KVM kernel Module
> > * Minimal Modification to the Qemu-Kvm code
> > * We plan to add the page prioritization algo in order to optimise the
> > pre paging algo and background transfer
> 
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.
> 
> thanks in advance.
> 
> > You can learn a little bit more and see a demo here:
> > http://tinyurl.com/8xa2bgl
> > I hope to be able to provide more detail on the design soon. As well
> > as more concrete demo of the system ( live migration of VM running
> > large  enterprise apps such as ERP or In memory DB)
> > 
> > Note: this is just a step stone as the post copy live migration mainly
> > enable us to validate the architecture design and  code.
> > 
> > Regards
> > Benoit
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Regards
> > Benoit
> > 
> > 
> > On 12 January 2012 13:59, Avi Kivity  wrote:
> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > >> And it would be easy to convert a separated daemon process into a thread
> > >> in qemu.
> > >>
> > >> I think it should be done out side of qemu process for some reasons.
> > >> (I just repeat same discussion at the KVM-forum because no one remembers
> > >> it)
> > >>
> > >> - ptrace (and its variant)
> > >> ?? Some people want to investigate guest ram on host (qemu stopped or 
> > >> lively).
> > >> ?? For example, enhance crash utility and it will attach qemu process and
> > >> ?? debug guest kernel.
> > >
> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > > it's a problem for qemu debugging though.
> > >
> > >>
> > >> - core dump
> > >> ?? qemu process may core-dump.
> > >> ?? As postmortem analysis, people want to investigate guest RAM.
> > >> ?? Again enhance crash utility and it will read the core file and analyze
> > >> ?? guest kernel.
> > >> ?? When creating core, the qemu process is already dead.
> > >
> > > Yes, strong point.
> > >
> > >> It precludes the above possibilities to handle fault in qemu process.
> > >
> > > I agree.
> > >
> > >
> > > --
> > > error compiling committee.c: too many arguments to function
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > 
> > -- 
> > " The production of too many useful things results in too many useless 
> > people"
> > 
> 
> -- 
> yamahata
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Andrea Arcangeli
On Thu, Jan 12, 2012 at 03:59:59PM +0200, Avi Kivity wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > And it would be easy to convert a separated daemon process into a thread
> > in qemu.
> >
> > I think it should be done out side of qemu process for some reasons.
> > (I just repeat same discussion at the KVM-forum because no one remembers
> > it)
> >
> > - ptrace (and its variant)
> >   Some people want to investigate guest ram on host (qemu stopped or 
> > lively).
> >   For example, enhance crash utility and it will attach qemu process and
> >   debug guest kernel.
> 
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.

But you need to debug postcopy migration too with gdb or not? I don't
see a big benefit in trying to prevent gdb to see really what is going
on in the qemu image.

> > - core dump
> >   qemu process may core-dump.
> >   As postmortem analysis, people want to investigate guest RAM.
> >   Again enhance crash utility and it will read the core file and analyze
> >   guest kernel.
> >   When creating core, the qemu process is already dead.
> 
> Yes, strong point.
> 
> > It precludes the above possibilities to handle fault in qemu process.
> 
> I agree.

In the receiving node if the memory is not there yet (and it isn't),
not sure how you plan to get a clean core dump (like if live migration
wasn't running) by preventing the kernel from dumping zeroes if qemu
crashes during post copy migration. Surely it won't be the kernel
crash handler completing the post migration, it won't even know where
to write the data in memory.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Andrea Arcangeli
On Thu, Jan 12, 2012 at 03:57:47PM +0200, Avi Kivity wrote:
> On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
> >  
> > > > So the problem is if we do it in
> > > > userland with the current functionality you'll run out of VMAs and
> > > > slowdown performance too much.
> > > >
> > > > But all you need is the ability to map single pages in the address
> > > > space.
> > > 
> > > Would this also let you set different pgprots for different pages in the 
> > > same VMA?  It would be useful for write barriers in garbage collectors 
> > > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > > permissions; however, they still put some strain on the VM subsystem.
> >
> > Changing permission sounds more tricky as more code may make
> > assumptions on the vma before checking the pte.
> >
> > Adding a magic unmapped pte entry sounds fairly safe because there's
> > the migration pte already used by migrate which halts page faults and
> > wait, that creates a precedent. So I guess we could reuse the same
> > code that already exists for the migration entry and we'd need to fire
> > a signal and returns to userland instead of waiting. The signal should
> > be invoked before the page fault will trigger again. 
> 
> Delivering signals is slow, and you can't use signalfd for it, because
> that can be routed to a different task.  I would like an fd based
> protocol with an explicit ack so the other end can be implemented by the
> kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
> via a kvm ioeventfd/irqfd.

As long as we tell qemu to run some per-vcpu handler (or io-thread
handler in case of access through the I/O thread) before accessing the
same memory address again, I don't see a problem in using a faster
mechanism than signals. In theory we could even implement this
functionality in kvm itself and just add two ioctl to the kvm fd,
instead of a syscall in fact but it'd be mangling over VM things like
page->index.

> > Of course if the
> > signal returns and does nothing it'll loop at 100% cpu load but that's
> > ok. Maybe it's possible to tweak the permissions but it will need a
> > lot more thoughts. Specifically for anon pages marking them readonly
> > sounds possible if they are supposed to behave like regular COWs (not
> > segfaulting or anything), as you already can have a mixture of
> > readonly and read-write ptes (not to tell readonly KSM pages), but for
> > any other case it's non trivial. Last but not the least the API here
> > would be like a vma-less-mremap, moving a page from one address to
> > another without modifying the vmas, the permission tweak sounds more
> > like an mprotect, so I'm unsure if it could do both or if it should be
> > an optimization to consider independently.
> 
> Doesn't this stuff require tlb flushes across all threads?

It does, to do it zerocopy and atomic we must move the pte.

> > In theory I suspect we could also teach mremap to do a
> > not-vma-mangling mremap if we move pages that aren't shared and so we
> > can adjust the page->index of the pages, instead of creating new vmas
> > at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> > syscall that only works on top of fault-unmapped areas is simpler and
> > safer. mremap semantics requires nuking the dst region before the move
> > starts. If we would teach mremap how to handle the fault-unmapped
> > areas we could just add one syscall prepare_fault_area (or whatever
> > name you choose).
> >
> > The locking of doing a vma-less-mremap still sounds tricky but I doubt
> > you can avoid that locking complexity by using the chardevice as long
> > as the chardevice backed-memory still allows THP, migration and swap,
> > if you want to do it atomic-zerocopy and I think zerocopy would be
> > better especially if the network card is fast and all vcpus are
> > faulting into unmapped pages simultaneously so triggering heavy amount
> > of copying from all physical cpus.
> >
> > I don't mean the current device driver doing a copy_user won't work or
> > is bad idea, it's more self contained and maybe easier to merge
> > upstream. I'm just presenting another option more VM integrated
> > zerocopy with just 2 syscalls.
> 
> Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
> ptes is cheap, removing them is not.  I wonder if we can make a
> write-only page?  Of course it's unmapped for cpu access, but we can
> allow DMA write access from the NIC.  Probably too wierd.

Keeping it mapped in two places gives problem with the non-linearity
of the page->index. We've 1 page->index and two different
vma->vm_pgoff, so it's just not a problem of readonly. We could even
it let it read-write as long as it is nuked when we swap. Problem is
we can't leave it there, we must update page->index, if we don't the
rmap walk breaks and swap/migrate with it... we would allow any thread
to see random memory th

Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Isaku Yamahata
Very interesting. We can cooperate for better (postcopy) live migration.
The code doesn't seem available yet, I'm eager for it.


On Fri, Jan 13, 2012 at 01:09:30AM +, Benoit Hudzia wrote:
> Hi,
> 
> Sorry to jump to hijack the thread  like that , however i would like
> to just to inform you  that we recently achieve a milestone out of the
> research project I'm leading. We enhanced KVM in order to deliver
> post copy live migration using RDMA at kernel level.
> 
> Few point on the architecture of the system :
> 
> * RDMA communication engine in kernel ( you can use soft iwarp or soft
> ROCE if you don't have hardware acceleration, however we also support
> standard RDMA enabled NIC) .

Do you mean infiniband subsystem?


> * Naturally Page are transferred with Zerop copy protocol
> * Leverage the async page fault system.
> * Pre paging / faulting
> * No context switch as everything is handled within kernel and using
> the page fault system.
> * Hybrid migration ( pre + post copy) available

Ah, I've been also planing this.
After pre-copy phase, is the dirty bitmap sent?

So far I've thought naively that pre-copy phase would be finished by the
number of iterations. On the other hand your choice is timeout of
pre-copy phase. Do you have rationale? or it was just natural for you?


> * Rely on an independent Kernel Module
> * No modification to the KVM kernel Module
> * Minimal Modification to the Qemu-Kvm code
> * We plan to add the page prioritization algo in order to optimise the
> pre paging algo and background transfer

Where do you plan to implement? in qemu or in your kernel module?
This algo could be shared.

thanks in advance.

> You can learn a little bit more and see a demo here:
> http://tinyurl.com/8xa2bgl
> I hope to be able to provide more detail on the design soon. As well
> as more concrete demo of the system ( live migration of VM running
> large  enterprise apps such as ERP or In memory DB)
> 
> Note: this is just a step stone as the post copy live migration mainly
> enable us to validate the architecture design and  code.
> 
> Regards
> Benoit
> 
> 
> 
> 
> 
> 
> 
> Regards
> Benoit
> 
> 
> On 12 January 2012 13:59, Avi Kivity  wrote:
> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> >> And it would be easy to convert a separated daemon process into a thread
> >> in qemu.
> >>
> >> I think it should be done out side of qemu process for some reasons.
> >> (I just repeat same discussion at the KVM-forum because no one remembers
> >> it)
> >>
> >> - ptrace (and its variant)
> >> ?? Some people want to investigate guest ram on host (qemu stopped or 
> >> lively).
> >> ?? For example, enhance crash utility and it will attach qemu process and
> >> ?? debug guest kernel.
> >
> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > it's a problem for qemu debugging though.
> >
> >>
> >> - core dump
> >> ?? qemu process may core-dump.
> >> ?? As postmortem analysis, people want to investigate guest RAM.
> >> ?? Again enhance crash utility and it will read the core file and analyze
> >> ?? guest kernel.
> >> ?? When creating core, the qemu process is already dead.
> >
> > Yes, strong point.
> >
> >> It precludes the above possibilities to handle fault in qemu process.
> >
> > I agree.
> >
> >
> > --
> > error compiling committee.c: too many arguments to function
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> " The production of too many useful things results in too many useless people"
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Takuya Yoshikawa

(2012/01/13 10:09), Benoit Hudzia wrote:

Hi,

Sorry to jump to hijack the thread  like that , however i would like
to just to inform you  that we recently achieve a milestone out of the
research project I'm leading. We enhanced KVM in order to deliver
post copy live migration using RDMA at kernel level.

Few point on the architecture of the system :

* RDMA communication engine in kernel ( you can use soft iwarp or soft
ROCE if you don't have hardware acceleration, however we also support
standard RDMA enabled NIC) .
* Naturally Page are transferred with Zerop copy protocol
* Leverage the async page fault system.
* Pre paging / faulting
* No context switch as everything is handled within kernel and using
the page fault system.
* Hybrid migration ( pre + post copy) available
* Rely on an independent Kernel Module
* No modification to the KVM kernel Module
* Minimal Modification to the Qemu-Kvm code
* We plan to add the page prioritization algo in order to optimise the
pre paging algo and background transfer


You can learn a little bit more and see a demo here:
http://tinyurl.com/8xa2bgl
I hope to be able to provide more detail on the design soon. As well
as more concrete demo of the system ( live migration of VM running
large  enterprise apps such as ERP or In memory DB)

Note: this is just a step stone as the post copy live migration mainly
enable us to validate the architecture design and  code.


Do you have any plan to send the patch series of your implementation?

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Benoit Hudzia
Hi,

Sorry to jump to hijack the thread  like that , however i would like
to just to inform you  that we recently achieve a milestone out of the
research project I'm leading. We enhanced KVM in order to deliver
post copy live migration using RDMA at kernel level.

Few point on the architecture of the system :

* RDMA communication engine in kernel ( you can use soft iwarp or soft
ROCE if you don't have hardware acceleration, however we also support
standard RDMA enabled NIC) .
* Naturally Page are transferred with Zerop copy protocol
* Leverage the async page fault system.
* Pre paging / faulting
* No context switch as everything is handled within kernel and using
the page fault system.
* Hybrid migration ( pre + post copy) available
* Rely on an independent Kernel Module
* No modification to the KVM kernel Module
* Minimal Modification to the Qemu-Kvm code
* We plan to add the page prioritization algo in order to optimise the
pre paging algo and background transfer


You can learn a little bit more and see a demo here:
http://tinyurl.com/8xa2bgl
I hope to be able to provide more detail on the design soon. As well
as more concrete demo of the system ( live migration of VM running
large  enterprise apps such as ERP or In memory DB)

Note: this is just a step stone as the post copy live migration mainly
enable us to validate the architecture design and  code.

Regards
Benoit







Regards
Benoit


On 12 January 2012 13:59, Avi Kivity  wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> And it would be easy to convert a separated daemon process into a thread
>> in qemu.
>>
>> I think it should be done out side of qemu process for some reasons.
>> (I just repeat same discussion at the KVM-forum because no one remembers
>> it)
>>
>> - ptrace (and its variant)
>>   Some people want to investigate guest ram on host (qemu stopped or lively).
>>   For example, enhance crash utility and it will attach qemu process and
>>   debug guest kernel.
>
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.
>
>>
>> - core dump
>>   qemu process may core-dump.
>>   As postmortem analysis, people want to investigate guest RAM.
>>   Again enhance crash utility and it will read the core file and analyze
>>   guest kernel.
>>   When creating core, the qemu process is already dead.
>
> Yes, strong point.
>
>> It precludes the above possibilities to handle fault in qemu process.
>
> I agree.
>
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
" The production of too many useful things results in too many useless people"
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Avi Kivity
On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> And it would be easy to convert a separated daemon process into a thread
> in qemu.
>
> I think it should be done out side of qemu process for some reasons.
> (I just repeat same discussion at the KVM-forum because no one remembers
> it)
>
> - ptrace (and its variant)
>   Some people want to investigate guest ram on host (qemu stopped or lively).
>   For example, enhance crash utility and it will attach qemu process and
>   debug guest kernel.

To debug the guest kernel you don't need to stop qemu itself.   I agree
it's a problem for qemu debugging though.

>
> - core dump
>   qemu process may core-dump.
>   As postmortem analysis, people want to investigate guest RAM.
>   Again enhance crash utility and it will read the core file and analyze
>   guest kernel.
>   When creating core, the qemu process is already dead.

Yes, strong point.

> It precludes the above possibilities to handle fault in qemu process.

I agree.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-12 Thread Avi Kivity
On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
>  
> > > So the problem is if we do it in
> > > userland with the current functionality you'll run out of VMAs and
> > > slowdown performance too much.
> > >
> > > But all you need is the ability to map single pages in the address
> > > space.
> > 
> > Would this also let you set different pgprots for different pages in the 
> > same VMA?  It would be useful for write barriers in garbage collectors 
> > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > permissions; however, they still put some strain on the VM subsystem.
>
> Changing permission sounds more tricky as more code may make
> assumptions on the vma before checking the pte.
>
> Adding a magic unmapped pte entry sounds fairly safe because there's
> the migration pte already used by migrate which halts page faults and
> wait, that creates a precedent. So I guess we could reuse the same
> code that already exists for the migration entry and we'd need to fire
> a signal and returns to userland instead of waiting. The signal should
> be invoked before the page fault will trigger again. 

Delivering signals is slow, and you can't use signalfd for it, because
that can be routed to a different task.  I would like an fd based
protocol with an explicit ack so the other end can be implemented by the
kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
via a kvm ioeventfd/irqfd.

> Of course if the
> signal returns and does nothing it'll loop at 100% cpu load but that's
> ok. Maybe it's possible to tweak the permissions but it will need a
> lot more thoughts. Specifically for anon pages marking them readonly
> sounds possible if they are supposed to behave like regular COWs (not
> segfaulting or anything), as you already can have a mixture of
> readonly and read-write ptes (not to tell readonly KSM pages), but for
> any other case it's non trivial. Last but not the least the API here
> would be like a vma-less-mremap, moving a page from one address to
> another without modifying the vmas, the permission tweak sounds more
> like an mprotect, so I'm unsure if it could do both or if it should be
> an optimization to consider independently.

Doesn't this stuff require tlb flushes across all threads?

>
> In theory I suspect we could also teach mremap to do a
> not-vma-mangling mremap if we move pages that aren't shared and so we
> can adjust the page->index of the pages, instead of creating new vmas
> at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> syscall that only works on top of fault-unmapped areas is simpler and
> safer. mremap semantics requires nuking the dst region before the move
> starts. If we would teach mremap how to handle the fault-unmapped
> areas we could just add one syscall prepare_fault_area (or whatever
> name you choose).
>
> The locking of doing a vma-less-mremap still sounds tricky but I doubt
> you can avoid that locking complexity by using the chardevice as long
> as the chardevice backed-memory still allows THP, migration and swap,
> if you want to do it atomic-zerocopy and I think zerocopy would be
> better especially if the network card is fast and all vcpus are
> faulting into unmapped pages simultaneously so triggering heavy amount
> of copying from all physical cpus.
>
> I don't mean the current device driver doing a copy_user won't work or
> is bad idea, it's more self contained and maybe easier to merge
> upstream. I'm just presenting another option more VM integrated
> zerocopy with just 2 syscalls.

Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
ptes is cheap, removing them is not.  I wonder if we can make a
write-only page?  Of course it's unmapped for cpu access, but we can
allow DMA write access from the NIC.  Probably too wierd.

>
> vmas must not be involved in the mremap for reliability, or too much
> memory could get pinned in vmas even if we temporary lift the
> /proc/sys/vm/max_map_count for the process. Plus sending another
> signal (not sigsegv or sigbus) should be more reliable in case the
> migration crashes for real.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-03 Thread Isaku Yamahata
On Mon, Jan 02, 2012 at 06:05:51PM +0100, Andrea Arcangeli wrote:
> On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> > On 12/29/2011 06:00 PM, Avi Kivity wrote:
> > > The NFS client has exactly the same issue, if you mount it with the intr
> > > option.  In fact you could use the NFS client as a trivial umem/cuse
> > > prototype.
> > 
> > Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> 
> During KVMForum I suggested to a few people that it could be done
> entirely in userland with PROT_NONE. So the problem is if we do it in
> userland with the current functionality you'll run out of VMAs and
> slowdown performance too much.
> 
> But all you need is the ability to map single pages in the address
> space. The only special requirement is that a new vma must not be
> created during the map operation. It'd be very similar to
> remap_file_pages for MAP_SHARED, it also was created to avoid having
> to create new vmas on a large MAP_SHARED mapping and no other reason
> at all. In our case we deal with a large MAP_ANONYMOUS mapping and we
> must alter the pte without creating new vmas but the problem is very
> similar to remap_file_pages.
> 
> Qemu in the dst node can do:
> 
>   mmap(MAP_ANONYMOUS)
>   fault_area_prepare(start, end, signalnr)
> 
> prepare_fault_area will map the range with the magic pte.
> 
> Then when the signalnr fires, you do:
> 
>  send(givemepageX)
>  recv(&tmpaddr_aligned, PAGE_SIZE,...);
>  fault_area_map(final_dest_aligned, tmpaddr_aligned, size)
> 
> map_fault_area will check the pgprot of the two vmas mapping
> final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot
> and various other vma bits, and if all ok, it'll just copy the pte
> from tmpaddr_aligned, to final_dest_aligned and it'll update the
> page->index. It can fail if the page is shared to avoid dealing with
> the non-linearity of the page mapped in multiple vmas.
> 
> You basically need a bypass to avoid altering the pgprot of the vma,
> and enter into the pte a "magic" thing that fires signal handlers
> if accessed, without having to create new vmas. gup/gup_fast and stuff
> should just always fallback into handle_mm_fault when encountering such a
> thing, so returning failure as if gup_fast was run on a address beyond
> the end of the i_size in the MAP_SHARED case.

Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
And it would be easy to convert a separated daemon process into a thread
in qemu.

I think it should be done out side of qemu process for some reasons.
(I just repeat same discussion at the KVM-forum because no one remembers
it)

- ptrace (and its variant)
  Some people want to investigate guest ram on host (qemu stopped or lively).
  For example, enhance crash utility and it will attach qemu process and
  debug guest kernel.

- core dump
  qemu process may core-dump.
  As postmortem analysis, people want to investigate guest RAM.
  Again enhance crash utility and it will read the core file and analyze
  guest kernel.
  When creating core, the qemu process is already dead.

It precludes the above possibilities to handle fault in qemu process.


> THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE,
> KSM should work too but I doubt anybody tested it on MAP_PRIVATE of
> /dev/zero.

Oh great. It seems to work with anonymous page generally of non-anonymous VMA.
Is that right?
If correct, THP/KSM work with mmap(MAP_PRIVATE, /dev/umem...), do they?


> The device driver provides an advantage in being self contained but I
> doubt it's simpler. I suppose after migration is complete you'll still
> switch the vma back to regular anonymous vma so leading to the same
> result?

Yes, it was my original intention.
The page is anonymous, but the vma isn't anonymous. I concerned that
KSM/THP doesn't work with such pages.
If they work, it isn't necessary to switch the VMA into anonymous.


> The patch 2/2 is small and self contained so it's quite attractive, I
> didn't see patch 1/2, was it posted?

Posted. It's quite short and trivial which just do EXPORT_SYMBOL_GPL of
mem_cgroup_cache_chage and shmem_zero_setup.
I include it here for convenience.

>From e8bfda16a845eef4381872a331c6f0f200c3f7d7 Mon Sep 17 00:00:00 2001
Message-Id: 

In-Reply-To: 
References: 
From: Isaku Yamahata 
Date: Thu, 11 Aug 2011 20:05:28 +0900
Subject: [PATCH 1/2] export necessary symbols

Signed-off-by: Isaku Yamahata 
---
 mm/memcontrol.c |1 +
 mm/shmem.c  |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..85530fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,6 +2807,7 @@ int mem_cgroup_cache_charge(struct page *page, struct 
mm_struct *mm,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
 
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..d137a37 100644

Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-03 Thread Andrea Arcangeli
On Mon, Jan 02, 2012 at 06:55:18PM +0100, Paolo Bonzini wrote:
> On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:
> > On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> >> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> >>> The NFS client has exactly the same issue, if you mount it with the intr
> >>> option.  In fact you could use the NFS client as a trivial umem/cuse
> >>> prototype.
> >>
> >> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> >
> > During KVMForum I suggested to a few people that it could be done
> > entirely in userland with PROT_NONE.
> 
> Or MAP_NORESERVE.

MAP_NORESERVE has no effect with the default
/proc/sys/vm/overcommit_memory == 0, and in general has no effect until you
run out of memory. It's an accounting on/off switch only, mostly a noop.

> Anything you do that is CUSE-based should be doable in a separate QEMU 
> thread (rather than a different process that talks to CUSE).  If a 
> userspace CUSE-based solution could be done with acceptable performance, 
> the same thing would have the same or better performance if done 
> entirely within QEMU.

It should be somehow doable within qemu and the source node could
handle one connection per vcpu thread for the async network pageins.
 
> > So the problem is if we do it in
> > userland with the current functionality you'll run out of VMAs and
> > slowdown performance too much.
> >
> > But all you need is the ability to map single pages in the address
> > space.
> 
> Would this also let you set different pgprots for different pages in the 
> same VMA?  It would be useful for write barriers in garbage collectors 
> (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> GC cycles could merge all of them back to a single VMA with PROT_READ 
> permissions; however, they still put some strain on the VM subsystem.

Changing permission sounds more tricky as more code may make
assumptions on the vma before checking the pte.

Adding a magic unmapped pte entry sounds fairly safe because there's
the migration pte already used by migrate which halts page faults and
wait, that creates a precedent. So I guess we could reuse the same
code that already exists for the migration entry and we'd need to fire
a signal and returns to userland instead of waiting. The signal should
be invoked before the page fault will trigger again. Of course if the
signal returns and does nothing it'll loop at 100% cpu load but that's
ok. Maybe it's possible to tweak the permissions but it will need a
lot more thoughts. Specifically for anon pages marking them readonly
sounds possible if they are supposed to behave like regular COWs (not
segfaulting or anything), as you already can have a mixture of
readonly and read-write ptes (not to tell readonly KSM pages), but for
any other case it's non trivial. Last but not the least the API here
would be like a vma-less-mremap, moving a page from one address to
another without modifying the vmas, the permission tweak sounds more
like an mprotect, so I'm unsure if it could do both or if it should be
an optimization to consider independently.

In theory I suspect we could also teach mremap to do a
not-vma-mangling mremap if we move pages that aren't shared and so we
can adjust the page->index of the pages, instead of creating new vmas
at the dst address with an adjusted vma->vm_pgoff, but I suspect a
syscall that only works on top of fault-unmapped areas is simpler and
safer. mremap semantics requires nuking the dst region before the move
starts. If we would teach mremap how to handle the fault-unmapped
areas we could just add one syscall prepare_fault_area (or whatever
name you choose).

The locking of doing a vma-less-mremap still sounds tricky but I doubt
you can avoid that locking complexity by using the chardevice as long
as the chardevice backed-memory still allows THP, migration and swap,
if you want to do it atomic-zerocopy and I think zerocopy would be
better especially if the network card is fast and all vcpus are
faulting into unmapped pages simultaneously so triggering heavy amount
of copying from all physical cpus.

I don't mean the current device driver doing a copy_user won't work or
is bad idea, it's more self contained and maybe easier to merge
upstream. I'm just presenting another option more VM integrated
zerocopy with just 2 syscalls.

vmas must not be involved in the mremap for reliability, or too much
memory could get pinned in vmas even if we temporary lift the
/proc/sys/vm/max_map_count for the process. Plus sending another
signal (not sigsegv or sigbus) should be more reliable in case the
migration crashes for real.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-02 Thread Paolo Bonzini

On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:

On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:

On 12/29/2011 06:00 PM, Avi Kivity wrote:

The NFS client has exactly the same issue, if you mount it with the intr
option.  In fact you could use the NFS client as a trivial umem/cuse
prototype.


Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.


During KVMForum I suggested to a few people that it could be done
entirely in userland with PROT_NONE.


Or MAP_NORESERVE.

Anything you do that is CUSE-based should be doable in a separate QEMU 
thread (rather than a different process that talks to CUSE).  If a 
userspace CUSE-based solution could be done with acceptable performance, 
the same thing would have the same or better performance if done 
entirely within QEMU.



So the problem is if we do it in
userland with the current functionality you'll run out of VMAs and
slowdown performance too much.

But all you need is the ability to map single pages in the address
space.


Would this also let you set different pgprots for different pages in the 
same VMA?  It would be useful for write barriers in garbage collectors 
(such as boehm-gc).  These do not have _that_ many VMAs, because every 
GC cycles could merge all of them back to a single VMA with PROT_READ 
permissions; however, they still put some strain on the VM subsystem.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2012-01-02 Thread Andrea Arcangeli
On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> > The NFS client has exactly the same issue, if you mount it with the intr
> > option.  In fact you could use the NFS client as a trivial umem/cuse
> > prototype.
> 
> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.

During KVMForum I suggested to a few people that it could be done
entirely in userland with PROT_NONE. So the problem is if we do it in
userland with the current functionality you'll run out of VMAs and
slowdown performance too much.

But all you need is the ability to map single pages in the address
space. The only special requirement is that a new vma must not be
created during the map operation. It'd be very similar to
remap_file_pages for MAP_SHARED, it also was created to avoid having
to create new vmas on a large MAP_SHARED mapping and no other reason
at all. In our case we deal with a large MAP_ANONYMOUS mapping and we
must alter the pte without creating new vmas but the problem is very
similar to remap_file_pages.

Qemu in the dst node can do:

mmap(MAP_ANONYMOUS)
fault_area_prepare(start, end, signalnr)

prepare_fault_area will map the range with the magic pte.

Then when the signalnr fires, you do:

 send(givemepageX)
 recv(&tmpaddr_aligned, PAGE_SIZE,...);
 fault_area_map(final_dest_aligned, tmpaddr_aligned, size)

map_fault_area will check the pgprot of the two vmas mapping
final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot
and various other vma bits, and if all ok, it'll just copy the pte
from tmpaddr_aligned, to final_dest_aligned and it'll update the
page->index. It can fail if the page is shared to avoid dealing with
the non-linearity of the page mapped in multiple vmas.

You basically need a bypass to avoid altering the pgprot of the vma,
and enter into the pte a "magic" thing that fires signal handlers
if accessed, without having to create new vmas. gup/gup_fast and stuff
should just always fallback into handle_mm_fault when encountering such a
thing, so returning failure as if gup_fast was run on a address beyond
the end of the i_size in the MAP_SHARED case.

THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE,
KSM should work too but I doubt anybody tested it on MAP_PRIVATE of
/dev/zero.

The device driver provides an advantage in being self contained but I
doubt it's simpler. I suppose after migration is complete you'll still
switch the vma back to regular anonymous vma so leading to the same
result?

The patch 2/2 is small and self contained so it's quite attractive, I
didn't see patch 1/2, was it posted?

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 06:00 PM, Avi Kivity wrote:
> The NFS client has exactly the same issue, if you mount it with the intr
> option.  In fact you could use the NFS client as a trivial umem/cuse
> prototype.

Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 05:53 PM, Isaku Yamahata wrote:
> On Thu, Dec 29, 2011 at 04:55:11PM +0200, Avi Kivity wrote:
> > On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > > > Great, then we agreed with list/reattach basically.
> > > > > (Maybe identity scheme needs reconsideration.)
> > > > 
> > > > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > > > fd, nothing else is needed.
> > >
> > > What if malicious process close the fd and does page fault intentionally?
> > > Unkillable process issue remains.
> > > I think we are talking not only qemu case but also general case.
> > 
> > It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
> > process signals.  This includes SIGKILL.
>
> Hmm, you said that the fault handler doesn't resolve the page fault.
>
> > > Don't resolve the page fault.  It's up to the user/system to make sure
> > > it happens.  qemu can easily do it by watching for the daemon's death
> > > and respawning it.
>
> To kill the process, the fault handler must return resolving the fault.
> It must return something. What do you expect? VM_FAULT_SIGBUS? zero page?

   if (signal_pending(current))
return VM_FAULT_RETRY;

for SIGKILL, the process dies immediately.  For other unblocked signals,
the process starts executing the signal handler, which isn't dependent
on the faulting page (of course the signal handler may itself fault).

The NFS client has exactly the same issue, if you mount it with the intr
option.  In fact you could use the NFS client as a trivial umem/cuse
prototype.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 04:55:11PM +0200, Avi Kivity wrote:
> On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > > Great, then we agreed with list/reattach basically.
> > > > (Maybe identity scheme needs reconsideration.)
> > > 
> > > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > > fd, nothing else is needed.
> >
> > What if malicious process close the fd and does page fault intentionally?
> > Unkillable process issue remains.
> > I think we are talking not only qemu case but also general case.
> 
> It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
> process signals.  This includes SIGKILL.

Hmm, you said that the fault handler doesn't resolve the page fault.

> > Don't resolve the page fault.  It's up to the user/system to make sure
> > it happens.  qemu can easily do it by watching for the daemon's death
> > and respawning it.

To kill the process, the fault handler must return resolving the fault.
It must return something. What do you expect? VM_FAULT_SIGBUS? zero page?
-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > Great, then we agreed with list/reattach basically.
> > > (Maybe identity scheme needs reconsideration.)
> > 
> > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > fd, nothing else is needed.
>
> What if malicious process close the fd and does page fault intentionally?
> Unkillable process issue remains.
> I think we are talking not only qemu case but also general case.

It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
process signals.  This includes SIGKILL.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 04:35:36PM +0200, Avi Kivity wrote:
> On 12/29/2011 04:18 PM, Isaku Yamahata wrote:
> > > >
> > > > The issue is how to solve the page fault, not whether 
> > > > TASK_INTERRUPTIBLE or
> > > > TASK_UNINTERRUPTIBLE.
> > > > I can think of several options.
> > > > - When daemon X is dead, all page faults are served by zero pages.
> > > > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > > > - list/reattach: complications. You don't like it
> > > > - other?
> > > 
> > > Don't resolve the page fault.  It's up to the user/system to make sure
> > > it happens.  qemu can easily do it by watching for the daemon's death
> > > and respawning it.
> > > 
> > > When the new daemon is started, it can ask the kernel for a list of
> > > pending requests, and service them.
> >
> > Great, then we agreed with list/reattach basically.
> > (Maybe identity scheme needs reconsideration.)
> 
> I guess we miscommunicated.  Why is reattach needed?  If you have the
> fd, nothing else is needed.

What if malicious process close the fd and does page fault intentionally?
Unkillable process issue remains.
I think we are talking not only qemu case but also general case.
-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 04:18 PM, Isaku Yamahata wrote:
> > >
> > > The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE 
> > > or
> > > TASK_UNINTERRUPTIBLE.
> > > I can think of several options.
> > > - When daemon X is dead, all page faults are served by zero pages.
> > > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > > - list/reattach: complications. You don't like it
> > > - other?
> > 
> > Don't resolve the page fault.  It's up to the user/system to make sure
> > it happens.  qemu can easily do it by watching for the daemon's death
> > and respawning it.
> > 
> > When the new daemon is started, it can ask the kernel for a list of
> > pending requests, and service them.
>
> Great, then we agreed with list/reattach basically.
> (Maybe identity scheme needs reconsideration.)

I guess we miscommunicated.  Why is reattach needed?  If you have the
fd, nothing else is needed.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 03:52:58PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:49 PM, Isaku Yamahata wrote:
> > > 
> > > qemu can have an extra thread that wait4()s the daemon, and relaunch
> > > it.  This extra thread would not be blocked by the page fault.  It can
> > > keep the fd so it isn't lost.
> > > 
> > > The unkillability of process A is a security issue; it could be done on
> > > purpose.  Is it possible to change umem to sleep with
> > > TASK_INTERRUPTIBLE, so it can be killed?
> >
> > The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> > TASK_UNINTERRUPTIBLE.
> > I can think of several options.
> > - When daemon X is dead, all page faults are served by zero pages.
> > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > - list/reattach: complications. You don't like it
> > - other?
> 
> Don't resolve the page fault.  It's up to the user/system to make sure
> it happens.  qemu can easily do it by watching for the daemon's death
> and respawning it.
> 
> When the new daemon is started, it can ask the kernel for a list of
> pending requests, and service them.

Great, then we agreed with list/reattach basically.
(Maybe identity scheme needs reconsideration.)
-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 03:49 PM, Isaku Yamahata wrote:
> > 
> > qemu can have an extra thread that wait4()s the daemon, and relaunch
> > it.  This extra thread would not be blocked by the page fault.  It can
> > keep the fd so it isn't lost.
> > 
> > The unkillability of process A is a security issue; it could be done on
> > purpose.  Is it possible to change umem to sleep with
> > TASK_INTERRUPTIBLE, so it can be killed?
>
> The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> TASK_UNINTERRUPTIBLE.
> I can think of several options.
> - When daemon X is dead, all page faults are served by zero pages.
> - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> - list/reattach: complications. You don't like it
> - other?

Don't resolve the page fault.  It's up to the user/system to make sure
it happens.  qemu can easily do it by watching for the daemon's death
and respawning it.

When the new daemon is started, it can ask the kernel for a list of
pending requests, and service them.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 02:55:42PM +0200, Avi Kivity wrote:
> On 12/29/2011 02:39 PM, Isaku Yamahata wrote:
> > > > ioctl commands:
> > > >
> > > > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > > > UMEM_DEV_LIST: list created umem devices
> > > > UMEM_DEV_REATTACH: re-attach the created umem device
> > > >   UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> > > >   the process that services page fault disappears or 
> > > > get stack.
> > > >   Then, administrator can list the umem devices and 
> > > > unblock
> > > >   the process which is waiting for page.
> > > 
> > > Ah, I asked about this in my patch comments.  I think this is done
> > > better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> > > new process.
> >
> > Can you please elaborate? I think those ways you are suggesting doesn't 
> > solve
> > the issue. Let me clarify the problem.
> >
> >   process A (typically incoming qemu)
> >  |
> >  | mmap("/dev/umem") and access those pages triggering page faults
> >  | (the file descriptor might be closed after mmap() before page faults)
> >  |
> >  V
> >/dev/umem
> >  ^
> >  |
> >  |
> >daemon X resolving page faults triggered by process A
> >(typically this daemon forked from incoming qemu:process A)
> >
> > If daemon X disappears accidentally, there is no one that resolves
> > page faults of process A. At this moment process A is blocked due to page
> > fault. There is no file descriptor available corresponding to the VMA.
> > Here there is no way to kill process A, but system reboot.
> 
> qemu can have an extra thread that wait4()s the daemon, and relaunch
> it.  This extra thread would not be blocked by the page fault.  It can
> keep the fd so it isn't lost.
> 
> The unkillability of process A is a security issue; it could be done on
> purpose.  Is it possible to change umem to sleep with
> TASK_INTERRUPTIBLE, so it can be killed?

The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE.
I can think of several options.
- When daemon X is dead, all page faults are served by zero pages.
- When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
- list/reattach: complications. You don't like it
- other?


> > > Introducing a global namespace has a lot of complications attached.
> > > 
> > > >
> > > > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > > > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> > > >for daemon
> > > >
> > > > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> > > >  This is _NOT_ implemented yet.
> > > >  anonymous I'm not sure whether this can be 
> > > > implemented
> > > >  or not.
> > > 
> > > How do we find out?  This is fairly important, stuff like transparent
> > > hugepages and ksm only works on anonymous memory.
> >
> > I agree that this is important.
> > At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
> > (Or at lease he'll look into those stuff. My memory is vague, though.
> >  Please correct me if I'm wrong)
> 
> += Andrea (who can also provide feedback on umem in general)
> 
> -- 
> error compiling committee.c: too many arguments to function
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 02:39 PM, Isaku Yamahata wrote:
> > > ioctl commands:
> > >
> > > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > > UMEM_DEV_LIST: list created umem devices
> > > UMEM_DEV_REATTACH: re-attach the created umem device
> > > UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> > > the process that services page fault disappears or get stack.
> > > Then, administrator can list the umem devices and unblock
> > > the process which is waiting for page.
> > 
> > Ah, I asked about this in my patch comments.  I think this is done
> > better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> > new process.
>
> Can you please elaborate? I think those ways you are suggesting doesn't solve
> the issue. Let me clarify the problem.
>
>   process A (typically incoming qemu)
>  |
>  | mmap("/dev/umem") and access those pages triggering page faults
>  | (the file descriptor might be closed after mmap() before page faults)
>  |
>  V
>/dev/umem
>  ^
>  |
>  |
>daemon X resolving page faults triggered by process A
>(typically this daemon forked from incoming qemu:process A)
>
> If daemon X disappears accidentally, there is no one that resolves
> page faults of process A. At this moment process A is blocked due to page
> fault. There is no file descriptor available corresponding to the VMA.
> Here there is no way to kill process A, but system reboot.

qemu can have an extra thread that wait4()s the daemon, and relaunch
it.  This extra thread would not be blocked by the page fault.  It can
keep the fd so it isn't lost.

The unkillability of process A is a security issue; it could be done on
purpose.  Is it possible to change umem to sleep with
TASK_INTERRUPTIBLE, so it can be killed?

> > Introducing a global namespace has a lot of complications attached.
> > 
> > >
> > > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> > >for daemon
> > >
> > > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> > >This is _NOT_ implemented yet.
> > >  anonymous I'm not sure whether this can be 
> > > implemented
> > >  or not.
> > 
> > How do we find out?  This is fairly important, stuff like transparent
> > hugepages and ksm only works on anonymous memory.
>
> I agree that this is important.
> At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
> (Or at lease he'll look into those stuff. My memory is vague, though.
>  Please correct me if I'm wrong)

+= Andrea (who can also provide feedback on umem in general)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 01:24:32PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This is Linux kernel driver for qemu/kvm postcopy live migration.
> > This is used by qemu/kvm postcopy live migration patch.
> >
> > TODO:
> > - Consider FUSE/CUSE option
> >   So far several mmap patches for FUSE/CUSE are floating around. (their
> >   purpose isn't different from our purpose, though). They haven't merged
> >   into the upstream yet.
> >   The driver specific part in qemu patches is modularized. So I expect it
> >   wouldn't be difficult to switch kernel driver to CUSE based driver.
> 
> It would be good to get more input about this, please involve lkml and
> the FUSE/CUSE people.

Okay.


> > ioctl commands:
> >
> > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > UMEM_DEV_LIST: list created umem devices
> > UMEM_DEV_REATTACH: re-attach the created umem device
> >   UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> >   the process that services page fault disappears or get stack.
> >   Then, administrator can list the umem devices and unblock
> >   the process which is waiting for page.
> 
> Ah, I asked about this in my patch comments.  I think this is done
> better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> new process.

Can you please elaborate? I think those ways you are suggesting doesn't solve
the issue. Let me clarify the problem.

  process A (typically incoming qemu)
 |
 | mmap("/dev/umem") and access those pages triggering page faults
 | (the file descriptor might be closed after mmap() before page faults)
 |
 V
   /dev/umem
 ^
 |
 |
   daemon X resolving page faults triggered by process A
   (typically this daemon forked from incoming qemu:process A)

If daemon X disappears accidentally, there is no one that resolves
page faults of process A. At this moment process A is blocked due to page
fault. There is no file descriptor available corresponding to the VMA.
Here there is no way to kill process A, but system reboot.


> Introducing a global namespace has a lot of complications attached.
> 
> >
> > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> >for daemon
> >
> > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> >  This is _NOT_ implemented yet.
> >  anonymous I'm not sure whether this can be 
> > implemented
> >  or not.
> 
> How do we find out?  This is fairly important, stuff like transparent
> hugepages and ksm only works on anonymous memory.

I agree that this is important.
At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
(Or at lease he'll look into those stuff. My memory is vague, though.
 Please correct me if I'm wrong)
-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-29 Thread Avi Kivity
On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This is Linux kernel driver for qemu/kvm postcopy live migration.
> This is used by qemu/kvm postcopy live migration patch.
>
> TODO:
> - Consider FUSE/CUSE option
>   So far several mmap patches for FUSE/CUSE are floating around. (their
>   purpose isn't different from our purpose, though). They haven't merged
>   into the upstream yet.
>   The driver specific part in qemu patches is modularized. So I expect it
>   wouldn't be difficult to switch kernel driver to CUSE based driver.

It would be good to get more input about this, please involve lkml and
the FUSE/CUSE people.

> ioctl commands:
>
> UMEM_DEV_CRATE_UMEM: create umem device for qemu
> UMEM_DEV_LIST: list created umem devices
> UMEM_DEV_REATTACH: re-attach the created umem device
> UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> the process that services page fault disappears or get stack.
> Then, administrator can list the umem devices and unblock
> the process which is waiting for page.

Ah, I asked about this in my patch comments.  I think this is done
better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
new process.

Introducing a global namespace has a lot of complications attached.

>
> UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
>for daemon
>
> UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
>This is _NOT_ implemented yet.
>  anonymous I'm not sure whether this can be 
> implemented
>  or not.

How do we find out?  This is fairly important, stuff like transparent
hugepages and ksm only works on anonymous memory.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-28 Thread Isaku Yamahata
On Thu, Dec 29, 2011 at 10:26:16AM +0900, Isaku Yamahata wrote:

> UMEM_DEV_LIST: list created umem devices
> UMEM_DEV_REATTACH: re-attach the created umem device
> UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> the process that services page fault disappears or get stack.
> Then, administrator can list the umem devices and unblock
> the process which is waiting for page.

Here is a simple utility which cleans up umem devices.

---

/*
 * simple clean up utility of for umem devices
 *
 * Copyright (c) 2011,
 * National Institute of Advanced Industrial Science and Technology
 *
 * https://sites.google.com/site/grivonhome/quick-kvm-migration
 * Author: Isaku Yamahata 
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms and conditions of the GNU General Public License,
 * version 2, as published by the Free Software Foundation.
 *
 * This program is distributed in the hope it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 * more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, see .
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

void mark_all_pages_cached(int umem_dev_fd, const char *id, const char *name)
{
struct umem_create create;
memset(&create, 0, sizeof(create));
strncpy(create.name.id, id, sizeof(create.name.id));
strncpy(create.name.name, name, sizeof(create.name.name));

if (ioctl(umem_dev_fd, UMEM_DEV_REATTACH, &create) < 0) {
err(EXIT_FAILURE, "UMEM_DEV_REATTACH");
}

close(create.shmem_fd);
long page_size = sysconf(_SC_PAGESIZE);
int page_shift = ffs(page_size) - 1;
int umem_fd = create.umem_fd;
printf("umem_fd %d size %"PRId64"\n", umem_fd, (uint64_t)create.size);

__u64 i;
__u64 e_pgoff = (create.size + page_size - 1) >> page_shift;
#define UMEM_CACHED_MAX 512
__u64 pgoffs[UMEM_CACHED_MAX];
struct umem_page_cached page_cached = {
.nr = 0,
.pgoffs = pgoffs,
};

for (i = 0; i < e_pgoff; i++) {
page_cached.pgoffs[page_cached.nr] = i;
page_cached.nr++;
if (page_cached.nr == UMEM_CACHED_MAX) {
if (ioctl(umem_fd, UMEM_MARK_PAGE_CACHED,
  &page_cached) < 0) {
err(EXIT_FAILURE, "UMEM_MARK_PAGE_CACHED");
}
page_cached.nr = 0;
}
}
if (page_cached.nr > 0) {
if (ioctl(umem_fd, UMEM_MARK_PAGE_CACHED, &page_cached) < 0) {
err(EXIT_FAILURE, "UMEM_MARK_PAGE_CACHED");
}
}
close(umem_fd);
}

#define DEV_UMEM"/dev/umem"

int main(int argc, char **argv)
{
const char *id = NULL;
const char *name = NULL;
if (argc >= 2) {
id = argv[1];
}
if (argc >= 3) {
name = argv[2];
}

int umem_dev_fd = open(DEV_UMEM, O_RDWR);
if (umem_dev_fd < 0) {
perror("can't open "DEV_UMEM);
exit(EXIT_FAILURE);
}

struct umem_list tmp_ulist = {
.nr = 0,
};
if (ioctl(umem_dev_fd, UMEM_DEV_LIST, &tmp_ulist) < 0) {
err(EXIT_FAILURE, "UMEM_DEV_LIST");
}
if (tmp_ulist.nr == 0) {
printf("no umem files\n");
exit(EXIT_SUCCESS);
}
struct umem_list *ulist = malloc(
sizeof(*ulist) + sizeof(ulist->names[0]) * tmp_ulist.nr);
ulist->nr = tmp_ulist.nr;
if (ioctl(umem_dev_fd, UMEM_DEV_LIST, ulist) < 0) {
err(EXIT_FAILURE, "UMEM_DEV_LIST");
}

uint32_t i;
for (i = 0; i < ulist->nr; ++i) {
char *u_id = ulist->names[i].id;
char *u_name = ulist->names[i].name;

char tmp_id_c = u_id[UMEM_ID_MAX - 1];
char tmp_name_c = u_name[UMEM_NAME_MAX - 1];
u_id[UMEM_ID_MAX - 1] = '\0';
u_name[UMEM_NAME_MAX - 1] = '\0';
printf("%d: id: %s name: %s\n", i, u_id, u_name);

if ((id != NULL || name != NULL) &&
(id == NULL || strncmp(id, u_id, UMEM_ID_MAX) == 0) &&
(name == NULL ||
 strncmp(name, u_name, UMEM_NAME_MAX) == 0)) {
printf("marking cached: %d: id: %s name: %s\n",
   i, u_id, 

[PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

2011-12-28 Thread Isaku Yamahata
This is Linux kernel driver for qemu/kvm postcopy live migration.
This is used by qemu/kvm postcopy live migration patch.

TODO:
- Consider FUSE/CUSE option
  So far several mmap patches for FUSE/CUSE are floating around. (their
  purpose isn't different from our purpose, though). They haven't merged
  into the upstream yet.
  The driver specific part in qemu patches is modularized. So I expect it
  wouldn't be difficult to switch kernel driver to CUSE based driver.

ioctl commands:

UMEM_DEV_CRATE_UMEM: create umem device for qemu
UMEM_DEV_LIST: list created umem devices
UMEM_DEV_REATTACH: re-attach the created umem device
  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
  the process that services page fault disappears or get stack.
  Then, administrator can list the umem devices and unblock
  the process which is waiting for page.

UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
   for daemon

UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
 This is _NOT_ implemented yet.
 anonymous I'm not sure whether this can be implemented
 or not.


---
Changes version 1 -> 2:
- make ioctl structures padded to align
- un-KVM
  KVM_VMEM -> UMEM
- dropped some ioctl commands as Avi requested

Isaku Yamahata (2):
  export necessary symbols
  umem: chardevice for kvm postcopy

 drivers/char/Kconfig  |9 +
 drivers/char/Makefile |1 +
 drivers/char/umem.c   |  898 +
 include/linux/umem.h  |   83 +
 mm/memcontrol.c   |1 +
 mm/shmem.c|1 +
 6 files changed, 993 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/umem.c
 create mode 100644 include/linux/umem.h

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html