Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-26 Thread Juan Quintela
Hi,

your console works great, but rest of patches are assuming:

arch/x86/boot/compressed/notes-xen.c
arch/x86/xen/early.c

at least.  It looks as if there is missing another patche, could you
take a look, please?
Otherwise, I will take a look at what is missing.

It breaks with:

Intel machine check architecture supported.
(XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from :0001 to
:.
Intel machine check reporting enabled on CPU#0.
general protection fault:  [#1] SMP
Modules linked in:

Pid: 1, comm: swapper Not tainted (2.6.24-rc3-q2 #10)
EIP: 0061:[] EFLAGS: 00010082 CPU: 0
EIP is at native_write_cr0+0x0/0x4
EAX: c005003b EBX: c03902a0 ECX: ed03f288 EDX: 0005
ESI: c1c10c80 EDI: ed054200 EBP: 0001 ESP: ed027eb8
 DS: 007b ES: 007b FS: 00d8 GS:  SS: e021
Process swapper (pid: 1, ti=ed027000 task=ed03ebb0 task.ti=ed027000)
Stack: c01125e9  c03902a0 c1c10c80 ed054200 c01128c6 c03900a0 0008
   c010e0aa c037b48d  ed00efa0 ed027f24 000a c035215c c01e20a7
   c1c10c80 8008 06f4 00020800 c0143563 ed03ebb0 017fe000 c03902a0
Call Trace:
 [] prepare_set+0x20/0x86
 [] generic_set_all+0x28/0x34a
 [] identify_cpu+0x525/0x52d
 [] kvasprintf+0x3f/0x48
 [] trace_hardirqs_off+0x28/0xa1
 [] mtrr_ap_init+0x33/0x5d
 [] smp_store_cpu_info+0x32/0xb9
 [] xen_cpu_up+0x22c/0x3b4
 [] _cpu_up+0xab/0x120
 [] cpu_up+0x4e/0x61
 [] kernel_init+0x9e/0x2c6
 [] restore_nocheck+0x12/0x15
 [] kernel_init+0x0/0x2c6
 [] kernel_init+0x0/0x2c6
 [] kernel_thread_helper+0x7/0x10
 ===
Code: 53 89 cb 83 ec 08 89 14 24 89 da 8b 04 24 89 4c 24 04 89 f9 0f 30 31 c0 5a
 59 5b 5e 5f c3 0f 31 c3 0f 33 c3 0f 06 c3 0f 20 c0 c3 <0f> 22 c0 c3 0f 20 e0 c3
 31 c0 0f 20 e0 c3 0f 09 c3 0f 01 00 c3
EIP: [] native_write_cr0+0x0/0x4 SS:ESP e021:ed027eb8
Kernel panic - not syncing: Attempted to kill init!


Later, Juan.


On Nov 22, 2007 12:12 AM, Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
> Stephen C. Tweedie wrote:
> > I've been looking at the next steps to try to get Xen running fully on
> > top of pv_ops.  To that end, I've (just) started looking at one of the
> > next major jobs --- i686 dom0 on pv_ops.
> >
>
> Great!
>
> > There are still a number of things needing done to reach parity with
> > xen-unstable:
> >
> >   x86_64 xen on pv_ops
> >
>
> I think once pvops has been unified, Xen support should be fairly
> straightforward.  I wrote most of the existing code with 64-bit in mind,
> so I'm hoping I got it right...
>
> >   Paravirt framebuffer/keyboard
> >   CPU hotplug
> >   Balloon
> >
>
> I've done some preliminary work on balloon and hotplug.  I think balloon
> should make more use of memory hotplug, but a straight port would be a
> good first step.
>
> >   kexec
> >   driver domains
> >
> > but it looks like these can largely proceed in parallel if desired.
> >
> > My short-term goal with this is simply to come up with a first-pass
> > merge of the linux-2.6.18-xen.hg dom0 support into the current
> > kernel.org tree's pv_ops support.  No major refactoring in the first
> > pass, but absolutely no *-xen.c code copying.
> >
>
> Yes.  #ifdefs are the way to go here.
>
> > I'm just starting this, but at least with the version magic check (see
> >
> >   
> > http://lists.xensource.com/archives/html/xen-devel/2007-11/msg00601.html
> >
>
> I was just about to post a fix for this.
>
> > ) out of the way, an SMP dom0 running pv_ops gets all the way through
> > start_kernel() and into rest_init() before dying with an unsupported cr0
> > write.  (I'm using direct console hypercalls for printk for now, full
> > xencons is not working yet.)
> >
>
> I have some early dom0 patches already, though they're a few months old
> now.  Not much there, but I did do an early console implementation.
>
> > I'm happy to put up a git tree for this once it gets anywhere.  We'd
> > need to decide which tree to track for that purpose --- Linus's, or
> > perhaps the tglx or mingo x86 merge tree might make more sense.
> >
>
> Yes, I think the x86 tree is where we need to be, since there's a lot of
> activity there.
>
> I'll attach my dom0 patches for whatever use you can make of them.  The
> definitely won't apply to anything, not least because of the arch merge
> (though it looks like they did get converted by script), but also
> because they're based on some defunct experimental booting-from-bzImage
> patches.  But perhaps there's some useful stuff in there.
>
> I've also attached my xen-balloon and hotplug patches as-is.  They don't
> work completely, but they should be closer to applying.
>
> J
>
> ___
> Xen-devel mailing list
> [EMAIL PROTECTED]
> http://lists.xensource.com/xen-devel
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-26 Thread Jeremy Fitzhardinge
Juan Quintela wrote:
> Hi,
>
> your console works great, but rest of patches are assuming:
>
> arch/x86/boot/compressed/notes-xen.c
> arch/x86/xen/early.c
>   

Yes, those are leftovers from a somewhat unsuccessful attempt at getting
ELF-in-bzImage booting working.  I need to go back and make bzImage
booting work properly.

I posted those patches as a source of possibly useful code
snippets/summary of things I've looked at so far, rather than something
that can be directly used.

> at least.  It looks as if there is missing another patche, could you
> take a look, please?
> Otherwise, I will take a look at what is missing.
>
> It breaks with:
>
> Intel machine check architecture supported.
> (XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from :0001 
> to
> :.
> Intel machine check reporting enabled on CPU#0.
> general protection fault:  [#1] SMP
> Modules linked in:
>   

Hm.  Looks like Xen is getting upset about dom0 trying to disable
caching.  No, wait: 0x:?  That's strange; I wonder if
its just misreporting the value, because the code doesn't look like its
trying to write that.

Either way, the fix is to implement xen_write_cr0, and mask off any bits
that Xen won't want us to set/clear (or if it doesn't allow dom0 to
change cr0, just ignore all updates).

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Jan Beulich
>> It breaks with:
>>
>> Intel machine check architecture supported.
>> (XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from :0001 
>> to
>> :.
>> Intel machine check reporting enabled on CPU#0.
>> general protection fault:  [#1] SMP
>> Modules linked in:
>>   
>
>Hm.  Looks like Xen is getting upset about dom0 trying to disable
>caching.  No, wait: 0x:?  That's strange; I wonder if
>its just misreporting the value, because the code doesn't look like its
>trying to write that.
>
>Either way, the fix is to implement xen_write_cr0, and mask off any bits
>that Xen won't want us to set/clear (or if it doesn't allow dom0 to
>change cr0, just ignore all updates).

Why do you think that's a CR0 write? The messages clearly indicate an
MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
check CPUID bits before trying to touch any registers... (And similarly
amd_mcheck_init() is checking only the MCE bit, not the MCA one.)

But then I just noticed that Xen itself doesn't clear the MCE/MCA bits either
in emulate_forced_invalid_op(), apparently under the assumption that PV
guests wouldn't try to make use of this feature.

A simple workaround would be to force mce_disabled to 1 in early Xen
initialization.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Jeremy Fitzhardinge
Jan Beulich wrote:
>>> It breaks with:
>>>
>>> Intel machine check architecture supported.
>>> (XEN) traps.c:1734:d0 Domain attempted WRMSR 0404 from 
>>> :0001 to
>>> :.
>>> Intel machine check reporting enabled on CPU#0.
>>> general protection fault:  [#1] SMP
>>> Modules linked in:
>>>   
>>>   
>> Hm.  Looks like Xen is getting upset about dom0 trying to disable
>> caching.  No, wait: 0x:?  That's strange; I wonder if
>> its just misreporting the value, because the code doesn't look like its
>> trying to write that.
>>
>> Either way, the fix is to implement xen_write_cr0, and mask off any bits
>> that Xen won't want us to set/clear (or if it doesn't allow dom0 to
>> change cr0, just ignore all updates).
>> 
>
> Why do you think that's a CR0 write? 

Well, the oops says "EIP is at native_write_cr0+0x0/0x4", and the caller
is prepare_set(), which does:

/*  Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
cr0 = read_cr0() | X86_CR0_CD;
write_cr0(cr0);
wbinvd();

This is in preparation to setting up the MTRRs, which needs to be all
skipped anyway.

> The messages clearly indicate an
> MSR write, and these writes are clearly visible in intel_p{4,6}_mcheck_init()
> and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
> check CPUID bits before trying to touch any registers... (And similarly
> amd_mcheck_init() is checking only the MCE bit, not the MCA one.)
>   

The oops and backtrace doesn't suggest it's an MSR write.  Does a crX
write take the same path through the emulator as an MSR write?

> A simple workaround would be to force mce_disabled to 1 in early Xen
> initialization.
>   

That's probably necessary too.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Jan Beulich
>The oops and backtrace doesn't suggest it's an MSR write.  Does a crX

Oh, right, the MSR write is being ignored, not failed.

>write take the same path through the emulator as an MSR write?

No, the two operations take different paths.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-11-27 Thread Stephen C. Tweedie
Hi,

On Tue, 2007-11-27 at 09:00 -0800, Jeremy Fitzhardinge wrote:

> > Why do you think that's a CR0 write? 
> 
> Well, the oops says "EIP is at native_write_cr0+0x0/0x4", and the caller
> is prepare_set(), which does:
> 
>   /*  Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
>   cr0 = read_cr0() | X86_CR0_CD;
>   write_cr0(cr0);
>   wbinvd();
> 
> This is in preparation to setting up the MTRRs, 

Right: cpu 0 gets past the early mtrr init (on the boot CPU, all the
kernel does is to probe the existing mtrr config), but it dies on cpu 1
trying to copy the mtrr config across.  (As a consequence, we don't hit
this problem on UP configs.)

> which needs to be all skipped anyway.

We _could_ just skip it, but we still want some mtrr support for dom0.
Fortunately the kernel's mtrr interfaces are nicely modular already, so
I'm currently starting to plug the 2.6.18 mtrr/main-xen.c into pv_ops as
a modular mtrr provider.

> > The messages clearly indicate an
> > MSR write, and these writes are clearly visible in 
> > intel_p{4,6}_mcheck_init()
> > and amd_mcheck_init(). The question is why intel_p4_mcheck_init() doesn't
> > check CPUID bits before trying to touch any registers... (And similarly
> > amd_mcheck_init() is checking only the MCE bit, not the MCA one.)

> The oops and backtrace doesn't suggest it's an MSR write.  Does a crX
> write take the same path through the emulator as an MSR write?

We get a slew of MCE-related MSR write warnings from the HV on both boot
and auxiliary processor bring-up, but the kernel doesn't crash at those
points.  (Which is not necessarily a good thing, as it implies the
kernel thinks it has registered its MCE handling, but the MSR writes
have not actually been honoured.)  So it's not a showstopper right now,
but is something we'll still want to deal with at some stage.

> > A simple workaround would be to force mce_disabled to 1 in early Xen
> > initialization.

> That's probably necessary too.

It doesn't seem to be necessary, given that the kernel does get past
this point; it's probably desirable, though, at least until such time as
we can actually do the MCE support correctly.

--Stephen


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Gerd Hoffmann
Derek Murray wrote:
> I take the blame for that one. I added the hook because, if a process
> were to die whilst holding one or more grants, there were no hooks that
> would make it possible to carry out the grant-unmap. All existing hooks
> on either the device or the VMA were called *after* the PTEs were cleared.

Hmm.  What exactly is the issue here?

This is about *userspace* mappings, right?  As far as I can see from a
quick scan there of the code is an additional kernel space mapping for
the grants and the userspace mapping is optional.  I don't see any
problems with userspace mapping going away without *instant*
notification.  Cleaning up a bit later, called from the
file_ops->release callback maybe, should work ok.

The problem I see with the additional vm_ops callback is that I suspect
you'll have to come up with some *very* good arguments to get it
accepted by the VM (as in "virtual memory") folks and merged mainline.

> It gets better, though. The same hook is used in the version of blktap
> in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
> xen-3.1-testing):

Oh, I'm thinking more in the direction of killing blktap altogether in
favor of a pure userspace implementation on top of gntdev.

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Derek Murray

Gerd Hoffmann wrote:

Derek Murray wrote:

I take the blame for that one. I added the hook because, if a process
were to die whilst holding one or more grants, there were no hooks that
would make it possible to carry out the grant-unmap. All existing hooks
on either the device or the VMA were called *after* the PTEs were cleared.


Hmm.  What exactly is the issue here?

This is about *userspace* mappings, right?  As far as I can see from a
quick scan there of the code is an additional kernel space mapping for
the grants and the userspace mapping is optional.  I don't see any
problems with userspace mapping going away without *instant*
notification.  Cleaning up a bit later, called from the
file_ops->release callback maybe, should work ok.


If we let Linux zap the page tables before we unmap the grant reference, 
then it is not possible to unmap the grant reference. The 
unmap_grant_ref hypercall ultimately calls destroy_grant_pte_mapping in 
xen/arch/x86/mm.c, which ensures that the PTE does in fact point to the 
granted frame. Note also the comment further up in that file (in 
put_page_from_l1e):


/*
 * Check if this is a mapping that was established via a grant 
reference.
 * If it was then we should not be here: we require that such 
mappings are

 * explicitly destroyed via the grant-table interface.
 *
 * The upshot of this is that the guest can end up with active 
grants that

 * it cannot destroy (because it no longer has a PTE to present to the
 * grant-table interface). This can lead to subtle hard-to-catch bugs,
 * hence a special grant PTE flag can be enabled to catch the bug 
early.

 *
 * (Note that the undestroyable active grants are not a security 
hole in
 * Xen. All active grants can safely be cleaned up when the domain 
dies.)

 */

Effectively, there is a debug option that sets a bit in PTEs that map 
granted pages, and this can be used to force a domain_crash in the event 
that a VM tries to zap the entries normally. The normal behaviour is to 
silently accept the zap operation, and leak granted pages until the 
grantee domain is killed.



The problem I see with the additional vm_ops callback is that I suspect
you'll have to come up with some *very* good arguments to get it
accepted by the VM (as in "virtual memory") folks and merged mainline.


On this point I completely agree with you! If anyone has any less 
radical suggestions, then I'd be delighted to refactor the gntdev code 
to use them. However, I'm not currently aware of any alternative that 
maintains robustness to process crashes.



It gets better, though. The same hook is used in the version of blktap
in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
xen-3.1-testing):


Oh, I'm thinking more in the direction of killing blktap altogether in
favor of a pure userspace implementation on top of gntdev.


I think this would represent good progress, though I wonder if there 
would be a performance penalty due to performing the mapping and 
unmapping in user-space (multiple syscalls per mapping versus a single 
hypercall).


Cheers,

Derek Murray.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Derek Murray
I take the blame for that one. I added the hook because, if a process 
were to die whilst holding one or more grants, there were no hooks that 
would make it possible to carry out the grant-unmap. All existing hooks 
on either the device or the VMA were called *after* the PTEs were cleared.


It gets better, though. The same hook is used in the version of blktap 
in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for 
xen-3.1-testing):


http://xenbits.xensource.com/linux-2.6.18-xen.hg?file/fd879c0688bf/drivers/xen/blktap/blktap.c

Reverting back to the old (hookless) behaviour would be a retrograde 
step IMHO.


Cheers,

Derek Murray.

Gerd Hoffmann wrote:

Stephen C. Tweedie wrote:

Hi all,

  driver domains


Looked at the gntdev (grant table mappings for user space) driver,
noticed that one is not self-contained.  It needs a hook for page unmapping:

  http://xenbits.xensource.com/xen-3.1-testing.hg?rev/7180d2e61f92
  plus an s/ptep_get_and_clear_full/zap_pte/ fixup a few changesets
  later.

Upstreaming that one could become *uhm* intresting.  Nevertheless the
gntdev functionality is quite useful for writing pure userspace
backend drivers ...

cheers,
  Gerd

___
Xen-devel mailing list
[EMAIL PROTECTED]
http://lists.xensource.com/xen-devel


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Mark Williamson
> >> It gets better, though. The same hook is used in the version of blktap
> >> in linux-2.6.18-xen (not, as far as I can see, in the sparse tree for
> >> xen-3.1-testing):
> >
> > Oh, I'm thinking more in the direction of killing blktap altogether in
> > favor of a pure userspace implementation on top of gntdev.
>
> I think this would represent good progress, though I wonder if there
> would be a performance penalty due to performing the mapping and
> unmapping in user-space (multiple syscalls per mapping versus a single
> hypercall).

Maybe a change to the gntdev userspace API to allow batching of mapping 
requests?

I'm not aware of a batched mmap interface, which would seem to be the ideal 
solution; but it should be possible to batch this stuff somehow.  Although it 
seems like some kind of really weird ioctl might be needed :-S to do it 
*without* such a batched interface...

blktap in userspace, if any performance problems can be addressed, would seem 
to be a far nicer way of doing things.  And it's less code to merge 
upstream ;-)

Cheers,
Mark

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread D.G. Murray
Hi Mark, 

> Maybe a change to the gntdev userspace API to allow batching 
> of mapping requests?

Something along the lines of the following?

/**
 * Memory maps one or more grant references from one or more domains to a
 * contiguous local address range. Mappings should be unmapped with
 * xc_gnttab_munmap. Returns NULL on failure.
 *
 * @parm xcg_handle a handle on an open grant table interface
 * @parm count the number of grant references to be mapped
 * @parm domids an array of @count domain IDs by which the corresponding
@refs
 * were granted
 * @parm refs an array of @count grant references to be mapped
 * @parm prot same flag as in mmap()
 */
void *xc_gnttab_map_grant_refs(int xcg_handle,
   uint32_t count,
   uint32_t *domids,
   uint32_t *refs,
   int prot); 

http://xenbits.xensource.com/xen-unstable.hg?file/3057f813da14/tools/libxc/x
enctrl.h

> blktap in userspace, if any performance problems can be 
> addressed, would seem to be a far nicer way of doing things.  
> And it's less code to merge upstream ;-)

Agreed.

Cheers,

Derek.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Mark Williamson
> Hi Mark,
>
> > Maybe a change to the gntdev userspace API to allow batching
> > of mapping requests?
>
> Something along the lines of the following?

Just like that :-D

When you said "multiple syscalls per mapping" I assumed you meant that we'd 
lose the batching you get by doing a mulicall.  If it's just a couple of 
syscalls (plus, presumably a couple of hypercalls) per batch of mappings, my 
gut says it's probably not going to hurt block performance.  My guts have 
been wrong in (many!) ways before of course...

I guess the overhead *could* be reduced even more by just having a magic ioctl 
that did all the mmap-ing stuff in one operation, but that'd probably be 
really gross if it wasn't necessary!  And I doubt it'd make upstream very 
happy...

We'll also be eliminating the overheads involved in having a blktap ring for 
talking to userspace and having to move requests between that ring and the 
real block ring, so there's some definite wins in overheads as well.

Cheers,
Mark

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-03 Thread Gerd Hoffmann
Derek Murray wrote:
> If we let Linux zap the page tables before we unmap the grant reference,
> then it is not possible to unmap the grant reference. The
> unmap_grant_ref hypercall ultimately calls destroy_grant_pte_mapping in
> xen/arch/x86/mm.c, which ensures that the PTE does in fact point to the
> granted frame.

Hmm, I see.  You have to do that for every mapping, not just the last
(kernel) one to get release the grant.  And just dropping that check is
probably out of question because the guest could fool xen's reference
counting then?

> On this point I completely agree with you! If anyone has any less
> radical suggestions, then I'd be delighted to refactor the gntdev code
> to use them. However, I'm not currently aware of any alternative that
> maintains robustness to process crashes.

Oh, for me it isn't robust at all, it crashes on the first munmap
syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
xensource 2.6.18 yet.

Ideas what is wrong?
Who uses the gntdev device right now?

> I think this would represent good progress, though I wonder if there
> would be a performance penalty due to performing the mapping and
> unmapping in user-space (multiple syscalls per mapping versus a single
> hypercall).

I'd expect the hard disk (and how I/O is scheduled) being the
bottleneck, not the syscall overhead.  Nevertheless I plan to benchmark
it once I have it up and running.

cheers,
  Gerd
Linux version 2.6.21-2952.fc8xen ([EMAIL PROTECTED]) (gcc version 4.1.2 
20070925 (Red Hat 4.1.2-33)) #1 SMP Mon Nov 19 07:06:55 EST 2007
BIOS-provided physical RAM map:
sanitize start
sanitize bail 0
copy_e820_map() start:  size: 7491e000 end: 
7491e000 type: 1
 Xen:  - 7491e000 (usable)
1137MB HIGHMEM available.
727MB LOWMEM available.
NX (Execute Disable) protection: active
Entering add_active_range(0, 0, 477470) 0 entries of 256 used
Zone PFN ranges:
  DMA 0 ->   186366
  Normal 186366 ->   186366
  HighMem186366 ->   477470
early_node_map[1] active PFN ranges
0:0 ->   477470
On node 0 totalpages: 477470
  DMA zone: 1455 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 184911 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  HighMem zone: 2274 pages used for memmap
  HighMem zone: 288830 pages, LIFO batch:31
found SMP MP-table at 000ff780
DMI present.
ACPI: RSDP 000F9990, 0014 (r0 ACPIAM)
ACPI: RSDT 7D6B, 0044 (r1 A M I  OEMRSDT   5000708 MSFT   97)
ACPI: FACP 7D6B0200, 0084 (r2 A M I  OEMFACP   5000708 MSFT   97)
ACPI Warning (tbfadt-0360): Ignoring BIOS FADT r2 C-state control [20070126]
ACPI: DSDT 7D6B0490, 6643 (r1 SDBLI9 SDBLI944   44 INTL 20051117)
ACPI: FACS 7D6BE000, 0040
ACPI: APIC 7D6B0390, 006C (r1 A M I  OEMAPIC   5000708 MSFT   97)
ACPI: MCFG 7D6B0450, 003C (r1 A M I  OEMMCFG   5000708 MSFT   97)
ACPI: OEMB 7D6BE040, 0079 (r1 A M I  AMI_OEM   5000708 MSFT   97)
ACPI: ASF! 7D6B6AE0, 0099 (r32 LEGEND I865PASF1 INTL 20051117)
ACPI: GSCI 7D6BE0C0, 2024 (r1 A M I  GMCHSCI   5000708 MSFT   97)
ACPI: iEIT 7D6C00F0, 00B0 (r1 A M I  EITTABLE  5000708 MSFT   97)
ACPI: SSDT 7D6C0BC0, 0877 (r1 DpgPmmCpuPm   12 INTL 20051117)
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Detected 2992.804 MHz processor.
Built 1 zonelists.  Total pages: 473741
Kernel command line: ro root=/dev/xeni/fedora32 console=tty1 xencons=xvc0 
console=xvc0 panic=30
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c136c000 soft=c134c000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Xen reported: 2992.594 MHz processor.
Console: colour VGA+ 80x50
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Software IO TLB enabled: 
 Aperture: 2 megabytes
 Kernel range: c033c000 - c053c000
 Address size: 24 bits
vmalloc area: ee00-f4ffe000, maxmem 2d7fe000
Memory: 1866468k/1909880k available (2071k kernel code, 34068k reserved, 1080k 
data, 188k init, 1164416k highmem)
virtual kernel memory layout:
fixmap  : 0xf5315000 - 0xf57fe000   (5028 kB)
pkmap   : 0xf500 - 0xf520   (2048 kB)
vmalloc : 0xee00 - 0xf4ffe000   ( 111 MB)
lowmem  : 0xc000 - 0xed7fe000   ( 727 M

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-04 Thread Derek Murray

Gerd Hoffmann wrote:

On this point I completely agree with you! If anyone has any less
radical suggestions, then I'd be delighted to refactor the gntdev code
to use them. However, I'm not currently aware of any alternative that
maintains robustness to process crashes.


Oh, for me it isn't robust at all, it crashes on the first munmap
syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
xensource 2.6.18 yet.


My gut feeling is that something changed in mm between 2.6.18 and 
2.6.21, but that seems like a cop out so...



Ideas what is wrong?


Since the bug appears to be in page_remove_rmap, that would tend to 
imply that there is never a corresponding page_add_*_rmap 
(page_add_file_rmap?). My knowledge of the Linux mm code is a bit shaky 
here: should gntdev be doing this? Should we be using install_page (or a 
modified version thereof) to set the PTE?


Also, does a simple program that opens gntdev, maps a grant, 
accesses/writes to the page, and unmaps it (all using the xc_gnttab_* 
functions) work?



Who uses the gntdev device right now?


Good question! I'm aware of it being used in a few research projects, 
and it seems to work for them (though I think it is mostly used with the 
linux-2.6.18-xen kernel). Anyone else?



I think this would represent good progress, though I wonder if there
would be a performance penalty due to performing the mapping and
unmapping in user-space (multiple syscalls per mapping versus a single
hypercall).


I'd expect the hard disk (and how I/O is scheduled) being the
bottleneck, not the syscall overhead.  Nevertheless I plan to benchmark
it once I have it up and running.


Great to hear that you're working on this! Let me know if there's any 
other help I can provide with gntdev.


Cheers,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-04 Thread Stephen C. Tweedie
Hi,

On Tue, 2007-12-04 at 13:01 +0100, Gerd Hoffmann wrote:

> >> Who uses the gntdev device right now?
> > 
> > Good question! I'm aware of it being used in a few research projects,
> > and it seems to work for them (though I think it is mostly used with the
> > linux-2.6.18-xen kernel). Anyone else?
> 
> So it effectively got no real-world testing yet ...

So... the interface (a) cannot be used on the Linux VM without at least
one invasive VM modification, due to the requirement of ptes being
explicitly unmapped via hypercall; and (b) isn't used significantly in
real life yet.

I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.

(Just what is the cost of not having this functionality in blktap,
anyway?)

--Stephen


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-04 Thread Gerd Hoffmann
Derek Murray wrote:
> Gerd Hoffmann wrote:
>> Oh, for me it isn't robust at all, it crashes on the first munmap
>> syscall.  It is the Fedora 8 kernel.  See attachment.  Didn't try
>> xensource 2.6.18 yet.
> 
> My gut feeling is that something changed in mm between 2.6.18 and
> 2.6.21, but that seems like a cop out so...

Could be.  Cross checking failed thouth, 2.6.18 doesn't boot the machine
in question (intel devel box with ich9).  Doesn't finds the disk.
Probably the ahci driver is too old.

>> Ideas what is wrong?
> 
> Since the bug appears to be in page_remove_rmap, that would tend to
> imply that there is never a corresponding page_add_*_rmap
> (page_add_file_rmap?). My knowledge of the Linux mm code is a bit shaky
> here: should gntdev be doing this? Should we be using install_page (or a
> modified version thereof) to set the PTE?

Don't know, I'm just trying to use it.  I did some mm handling for
device drivers back in my video4linux days, but for that it wasn't
needed to be involved into setting/clearing pte entries.  I just had a
->nopage handler allocate the pages the way I needed it for the
userspace mappings of video dma buffers.

> Also, does a simple program that opens gntdev, maps a grant,
> accesses/writes to the page, and unmaps it (all using the xc_gnttab_*
> functions) work?

Didn't try yet.  The application in question (blkbackd) does this:

  * map blk shared ring
  * see the first request come in (kernel trying to read the
partition table).
  * map the grants of the request.
  * perform I/O.
  * Try to unmap the grants of the request.  On the first unmap call
the kernel oopses.

This all without even starting a guest, I'm just using "xm block-attach"
 to create a blkfront device in Dom0.

>> Who uses the gntdev device right now?
> 
> Good question! I'm aware of it being used in a few research projects,
> and it seems to work for them (though I think it is mostly used with the
> linux-2.6.18-xen kernel). Anyone else?

So it effectively got no real-world testing yet ...

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-04 Thread Gerd Hoffmann
Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, 2007-12-04 at 13:01 +0100, Gerd Hoffmann wrote:
> 
 Who uses the gntdev device right now?
>>> Good question! I'm aware of it being used in a few research projects,
>>> and it seems to work for them (though I think it is mostly used with the
>>> linux-2.6.18-xen kernel). Anyone else?
>> So it effectively got no real-world testing yet ...
> 
> So... the interface (a) cannot be used on the Linux VM without at least
> one invasive VM modification, due to the requirement of ptes being
> explicitly unmapped via hypercall; and (b) isn't used significantly in
> real life yet.

(c) seems not to work for anything non-trivial.  I've compiled and
tested a xensource 2.6.18 kernel (3.1 testing mercurial tree head,
should be 3.1.2-release), it fails in a simliar way.  See attachment.

Want reproduce?  Here we go:

  * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
  * grab a xenified dom0 kernel without blktap driver (either not
compiled or module not loaded).
  * start xend
  * start blkbackd from xenner package (you probably want the -d switch
for debug output, twice for more).
  * run "xm block-attach 0 tap:aio:/path/to/some/file xvda r"
  * watch it blow up ;)

> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?  Upstream Linux pv_ops certainly will, and it would be good if
> we could avoid tying unprivileged guests to ABIs which cannot hope to be
> merged into pv_ops.

And I fear the problems I've trapped into up to now is only the tip of
the iceberg.  What happens if an application with active grant table
mappings calls fork() ?

cheers,
  Gerd
Linux version 2.6.18-xen ([EMAIL PROTECTED]) (gcc version 4.1.2 20070925 (Red 
Hat 4.1.2-33)) #1 SMP Tue Dec 4 18:17:24 CET 2007
BIOS-provided physical RAM map:
 Xen:  - 0adc3000 (usable)
0MB HIGHMEM available.
173MB LOWMEM available.
On node 0 totalpages: 44483
  DMA zone: 44483 pages, LIFO batch:7
DMI 2.3 present.
ACPI: RSDP (v000 OID_00) @ 0x000f0010
ACPI: RSDT (v001 OID_00 RSDT_000 0x30303030 & 0x0001) @ 0x0bfffbd0
ACPI: FADT (v001 OID_00 FACP_000 0x30303030 & 0x0001) @ 0x0bfffb20
ACPI: BOOT (v001 OID_00 BOOT_000 0x30303030 & 0x0001) @ 0x0bfffba0
ACPI: DSDT (v001 INT440 SYSFexxx 0x1001 MSFT 0x010b) @ 0x
ACPI: Vendor "INT440" System "SYSFexxx" Revision 0x1001 has a known ACPI BIOS 
problem.
ACPI: Reason: Does not use _REG to protect EC OpRegions. This is a 
non-recoverable error
ACPI: Disabling ACPI support
Allocating PCI resources starting at 1000 (gap: 0c00:f3fc)
Detected 600.047 MHz processor.
Built 1 zonelists.  Total pages: 44483
Kernel command line: ro root=/dev/zen/rhel5 apm=off vga=0x317 panic=30
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 4096 bytes)
Xen reported: 600.034 MHz processor.
Console: colour VGA+ 80x50
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Software IO TLB enabled: 
 Aperture: 2 megabytes
 Kernel range: c0aad000 - c0cad000
 Address size: 24 bits
vmalloc area: cb80-f51fe000, maxmem 2d7fe000
Memory: 155572k/177932k available (1972k kernel code, 14020k reserved, 693k 
data, 192k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 1502.07 BogoMIPS (lpj=7510358)
Security Framework v1.0.0 initialized
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0387d1f1     
 
CPU: After vendor identify, caps: 0387d1f1     
 
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU serial number disabled.
CPU: After all inits, caps: 0383d1f1   0040  
 
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 12k freed
Brought up 1 CPUs
migration_cost=0
checking if image is initramfs... it is
Freeing initrd memory: 6538k freed
NET: Registered protocol family 16
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI: disabled
xen_mem: Initialising balloon driver.
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI quirk: region 1000-103f claimed by PIIX4 ACPI
PCI quirk: region 1400-140f claimed by PIIX4 SMB
PIIX4 devres C PIO at 0398-0399
Boot video device is :00:09.0
PCI: Using IRQ router PIIX/ICH [8086/7198] at 

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-04 Thread Mark Williamson
> I am not quite clear about the purpose of pv-ops , what do we want to
> deal with by developping "pv-ops"? is it used for HVM or for PV or KVM
> or something ? I have seen it for a few months in the list ,and
> "pv-ops"is an active project ,but i am not clear about what is the aim
> of "pv-ops" ,could you give me an explanation about it

PV-ops is an API within Linux which is used to support paravirtualisation.

paravirt-ops makes it possible to compile a Linux kernel which can boot on 
bare hardware, or on Xen, or using VMI (VMware's paravirtualised interface), 
lguest, or any other VMM that is supported.  The resulting kernel can then 
boot on any of those and make proper use of paravirtualisation.

For instance, with 2.6.23 from kernel.org you should be able to compile a 
kernel that will boot both on bare hardware and in a Xen domU in PV mode.  
Various tricks are used to ensure that it will run with good performance on 
both.

pv-ops mostly deals with the paravirtualisation of the CPU.  IO devices such 
as block and network are handled using Xen-aware drivers rather similar to 
those in the XenSource Linux kernels, they are not part of pv-ops.

Cheers,
Mark


> Thanks in advance
>
> Mark Williamson 写道:
> >> Hi Mark,
> >>
> >>> Maybe a change to the gntdev userspace API to allow batching
> >>> of mapping requests?
> >>
> >> Something along the lines of the following?
> >
> > Just like that :-D
> >
> > When you said "multiple syscalls per mapping" I assumed you meant that
> > we'd lose the batching you get by doing a mulicall.  If it's just a
> > couple of syscalls (plus, presumably a couple of hypercalls) per batch of
> > mappings, my gut says it's probably not going to hurt block performance. 
> > My guts have been wrong in (many!) ways before of course...
> >
> > I guess the overhead *could* be reduced even more by just having a magic
> > ioctl that did all the mmap-ing stuff in one operation, but that'd
> > probably be really gross if it wasn't necessary!  And I doubt it'd make
> > upstream very happy...
> >
> > We'll also be eliminating the overheads involved in having a blktap ring
> > for talking to userspace and having to move requests between that ring
> > and the real block ring, so there's some definite wins in overheads as
> > well.
> >
> > Cheers,
> > Mark



-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Hi Gerd,

Gerd Hoffmann wrote:
Want reproduce?  Here we go:


  * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
  * grab a xenified dom0 kernel without blktap driver (either not
compiled or module not loaded).
  * start xend
  * start blkbackd from xenner package (you probably want the -d switch
for debug output, twice for more).
  * run "xm block-attach 0 tap:aio:/path/to/some/file xvda r"
  * watch it blow up ;)


Thanks for the repro details. I'll have a go at this later. One thing we 
haven't tested AFAIK is mapping grants in the same domain: could you 
check to see if the bug is the same if you attach a block device to a 
domain other than Dom0? Also, could you send any Xen console output, if 
it contains errors or warnings?



I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.


And I fear the problems I've trapped into up to now is only the tip of
the iceberg.  What happens if an application with active grant table
mappings calls fork() ?


Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls 
copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls 
set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.


The hypercall will not succeed and will return an error code indicating 
the reason for this. Therefore the PTE will not be set. There appears to 
be no way to propagate this error through the Linux VM code, because 
there is no concept of a PTE update failing. I could add return codes to 
all those functions, but I don't fancy their chances upstream


A possibility for solving that might be to carry out the mappings upon a 
page fault: I believe this would be compatible with copy_page_range.


(In fact, it's possible that a forked process would attempt to 
demand-page in the granted page, bypassing the copy_page_range code. 
Since there is no nopage handler for a gntdev VMA, that would lead to an 
anonymous page being mapped into memory instead.)


So, as far as I can tell, there would be no kernel BUG() or 
domain_crash() in the event of a fork(). It looks like implementing 
nopage in gntdev would enable grants to be remapped after a fork() and 
the correct behaviour to happen.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
Stephen C. Tweedie wrote:
> I can't help wondering if this is a hint that now is the time to find a
> better API, which doesn't have the requirement (a) that seems to be
> causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
> to have the same problems with their VM layers if they try to implement
> this API?

Well, it isn't that easy unfortunaly.  We have to separate two things here:

  (a) the grant table hypercall API (linux kernel <-> xen).
  (b) the grant table device (userspace interface).

The hypercall API *is* heavily used, block and network drivers are using
it for example.  It works quite well as long as the drivers are living
in kernel space, thus the grants are also mapped in kernel space only.
It isn't very hard to control map and unmap then.

The problems start when the gntdev comes into play which wants allow
userspace applications map grant references.  At this point the whole VM
subsystem becomes involved.  And the requirement of the hypercall API to
 do any pte manipulation using grant table hypercalls becomes a real
burden.  The linux VM design simply doesn't allow that.

Consequently the current gntdev implementation tries to get the job done
by bypassing the VM (and hooking into it).  It establishes mappings by
doing the page table manipulations itself in the fops->mmap function.
It tears down mappings using the hook discussed earlier.

gntdev doesn't even try to handle forking.  I wouldn't be surprised if
that is a great way to kill Domain-0.  The xen hypervisor will most
likely not be amused to find a pte refering to a granted (but foreign)
page which wasn't established using the grant table interface.  Pinning
the pgd of the child process will most likely fail and make the kernel
BUG().

cheers,
  Gerd
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Gerd,

Can you try the attached patch against linux-2.6.18-xen.hg?

I think the problem was that the gntdev VMA is not marked as being 
VM_PFNMAP, therefore it tries to get a struct page_struct for each 
granted page when it is unmapped (and maybe sometimes succeeds 
(incorrectly), which could be why I haven't seen the bug). With this 
flag, vm_normal_page will return NULL in zap_pte_range, and so the code 
that decrements that reference count will not be executed.


Regards,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Set gntdev VMA to be VM_PFNMAP.

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
 
 	/* The VM area contains pages from another VM. */
 	vma->vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *  used by get_user_pages() to provide access to the
+	 *  page_structs for each page, but I'm not sure if that's
+	 *  necessary.
+	 */
+	vma->vm_flags |= VM_PFNMAP;
+
 	vma->vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
    GFP_KERNEL);
 	if (vma->vm_private_data == NULL) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Is this patch to go into linux-2.6.18-xen.hg then?


Yes, even if it doesn't fix the exact bug we're seeing here, I think it 
should go in. I've attached a version with my signed-off-by and a better 
commit comment.


Cheers,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Add VM_PFNMAP flag to gntdev-mmaped VM areas. This prevents an attempt in
zap_pte_range to decrement the reverse-mapping count of the non-existant
(but occasionally spuriously present) page_struct associated with the
granted PFN.

Signed-off-by: Derek Murray <[EMAIL PROTECTED]>

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
 
 	/* The VM area contains pages from another VM. */
 	vma->vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *  used by get_user_pages() to provide access to the
+	 *  page_structs for each page, but I'm not sure if that's
+	 *  necessary.
+	 */
+	vma->vm_flags |= VM_PFNMAP;
+
 	vma->vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
    GFP_KERNEL);
 	if (vma->vm_private_data == NULL) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
there any other responsibilities that you acquire if you make use of
VM_FOREIGN (in particular, how would this affect get_user_pages)?


VM_FOREIGN is already set for the gntdev VMA (mostly because it's 
directly based on the blktap code). That means that it has the array of 
page_structs in its vm_private_data, which can be used to fulfill a 
get_user_pages call. I've attached a patch based on this fix.


Regards,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196878124 0
# Node ID df7d0555ec3847bd5915063d8ee79123d6ebc67a
# Parent  ba918cb2cf7520604dee724dd80dad5ce4bee8a1
Changed vm_normal_page to return NULL when presented with a VMA marked
as being VM_FOREIGN.

Signed-off-by: Derek Murray <[EMAIL PROTECTED]>

diff -r ba918cb2cf75 -r df7d0555ec38 mm/memory.c
--- a/mm/memory.c	Tue Dec 04 11:54:22 2007 +
+++ b/mm/memory.c	Wed Dec 05 18:08:44 2007 +
@@ -395,6 +395,9 @@ struct page *vm_normal_page(struct vm_ar
 		if (!is_cow_mapping(vma->vm_flags))
 			return NULL;
 	}
+
+	if (unlikely(vma->vm_flags & VM_FOREIGN))
+		return NULL;
 
 	/*
 	 * Add some anal sanity checks for now. Eventually,
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Jeremy Fitzhardinge wrote:

Could we use one of the software-defined bits in the PTE to indicate
that this is a foreign/granted PTE, and have set_pte_at behave
differently if you pass it a pte with this bit set?


Actually, as Gerd pointed out in his answer to his own question, the use 
of VM_DONTCOPY cuts out this entire code path, so we don't need to worry 
about it.


Mind you, it looks like we're going to go ahead and use one of the PTE 
bits to signify foreign PTEs anyway, per Keir's suggestion. Either way, 
it's going to involve making Xen-specific changes to the mm code... have 
you any ideas how we can either (i) get rid of the zap_pte hook in the 
vm_operations_struct, or (ii) make a really compelling case to the 
kernel maintainers that it really should get in?


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:


Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
vm_normal_page() return NULL? But actually I would expect pte_pfn() to
return max_mapnr because the mapped page is not a local page. And that
should cause vm_normal_page() to return NULL always, regardless of whether
you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
the test that causes the crash?


That's right (gntdev is being used to map (but not grant) a local page). 
The test case creates a virtual block device in Dom0, and attempts to 
map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn 
succeeds.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Need to bite the bullet and fix this properly by setting a software flag in
ptes that are not subject to reference counting.


Could we get away with testing the VM_FOREIGN flag in vm_normal_page()? 
Although I get the impression that this wouldn't be easily justified if 
trying to merge with upstream Linux



Unfortunately that also needs a hypervisor interface change, to allow
setting of those pte flags. Easily done though, and we should definitely get
that piece in for 3.2.0.


Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for 
debugging? Indeed, if we did this, could be obviate the need for the 
PTE-zapping hook, by instead catching the case where this flag is set, 
and unmapping the grant implicitly?


Otherwise, what would the semantics of this new flag be?

Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Jeremy Fitzhardinge
Derek Murray wrote:
> Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls
> copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls
> set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.
>
> The hypercall will not succeed and will return an error code
> indicating the reason for this. Therefore the PTE will not be set.
> There appears to be no way to propagate this error through the Linux
> VM code, because there is no concept of a PTE update failing. I could
> add return codes to all those functions, but I don't fancy their
> chances upstream

Could we use one of the software-defined bits in the PTE to indicate
that this is a foreign/granted PTE, and have set_pte_at behave
differently if you pass it a pte with this bit set?

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Jeremy Fitzhardinge
Derek Murray wrote:
> Jeremy Fitzhardinge wrote:
>> Could we use one of the software-defined bits in the PTE to indicate
>> that this is a foreign/granted PTE, and have set_pte_at behave
>> differently if you pass it a pte with this bit set?
>
> Actually, as Gerd pointed out in his answer to his own question, the
> use of VM_DONTCOPY cuts out this entire code path, so we don't need to
> worry about it.
>
> Mind you, it looks like we're going to go ahead and use one of the PTE
> bits to signify foreign PTEs anyway, per Keir's suggestion. Either
> way, it's going to involve making Xen-specific changes to the mm code... 

Sneaking in a user for the otherwise completely unused PTE bits should
be fairly straightforward.

> have you any ideas how we can either (i) get rid of the zap_pte hook
> in the vm_operations_struct, or (ii) make a really compelling case to
> the kernel maintainers that it really should get in? 

Hm, I haven't spent much time looking at how grant tables and their
mappings work yet, so I can't say I really understand all this myself. 
Hence, questions:

Can we take a different approach from the zap_pte hook?  Given that
we're 1) planning on claiming a pte bit for grant mappings, and 2) need
to hook ptep_get_and_clear anyway to solve the mprotect performance
problems, couldn't we just special-case grant mapping pte_clears?

In 2.6.18-xen the only two implementations of zap_pte are
blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
grant-mapping bit set, could we determine which of these need calling
and do the appropriate thing?  Do we even need separate implementations
of the core pte-clearing functionality?  Could we just say something like:

if (pte & _PAGE_XEN_FOREIGN)
HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
else
xen_set_pte_at(...);


blktap_clear_pte and gntdev_clear_pte do other housekeeping, but do they
have to be done at the same instant as the grant mapping clear?  Could
they be done via some other hook?

(I see Gerd just proposed this, pretty much.)

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 17:17, "Derek Murray" <[EMAIL PROTECTED]> wrote:

>> Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
>> vm_normal_page() return NULL? But actually I would expect pte_pfn() to
>> return max_mapnr because the mapped page is not a local page. And that
>> should cause vm_normal_page() to return NULL always, regardless of whether
>> you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
>> the test that causes the crash?
> 
> That's right (gntdev is being used to map (but not grant) a local page).
> The test case creates a virtual block device in Dom0, and attempts to
> map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn
> succeeds.

Need to bite the bullet and fix this properly by setting a software flag in
ptes that are not subject to reference counting.

Unfortunately that also needs a hypervisor interface change, to allow
setting of those pte flags. Easily done though, and we should definitely get
that piece in for 3.2.0.

Setting VM_PFNMAP is bogus. We used to do that for privcmd mappings too, but
we stopped because IIRC it had other unwanted side effects.

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 20:15, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote:

> In 2.6.18-xen the only two implementations of zap_pte are
> blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
> grant-mapping bit set, could we determine which of these need calling
> and do the appropriate thing?  Do we even need separate implementations
> of the core pte-clearing functionality?  Could we just say something like:
> 
> if (pte & _PAGE_XEN_FOREIGN)
> HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
> else
> xen_set_pte_at(...);

You'd need to track pte->grant_handle mappings somewhere, but it could
certainly be done this way, yes.

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 17:48, "Derek Murray" <[EMAIL PROTECTED]> wrote:

> Keir Fraser wrote:
>> Need to bite the bullet and fix this properly by setting a software flag in
>> ptes that are not subject to reference counting.
> 
> Could we get away with testing the VM_FOREIGN flag in vm_normal_page()?
> Although I get the impression that this wouldn't be easily justified if
> trying to merge with upstream Linux

Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
there any other responsibilities that you acquire if you make use of
VM_FOREIGN (in particular, how would this affect get_user_pages)?

> Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for
> debugging? Indeed, if we did this, could be obviate the need for the
> PTE-zapping hook, by instead catching the case where this flag is set,
> and unmapping the grant implicitly?

Well, in the general case you don't have enough info to know which grant to
release (a single page can be granted multiple times).

> Otherwise, what would the semantics of this new flag be?

It would cause pte_pfn() to return max_mapnr. It would be set for any
foreign page mapping, and replace mfn_to_local_pfn() in pte_pfn().

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
>> Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for
>> debugging? Indeed, if we did this, could be obviate the need for the
>> PTE-zapping hook, by instead catching the case where this flag is set,
>> and unmapping the grant implicitly?
> 
> Well, in the general case you don't have enough info to know which grant to
> release (a single page can be granted multiple times).

You'll also get the mm and the addr which should make it sufficiently
unique, so this looks like a doable approach to me.

ptep_get_and_clear_full() in include/asm-x86/pgtable_32.h needs to be
changed take care, but that should be possible to do and the change is
local to x86 paravirt_ops, which looks much better to me than touching
generic mm code.

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 14:30, "Derek Murray" <[EMAIL PROTECTED]> wrote:

> Keir Fraser wrote:
>> Is this patch to go into linux-2.6.18-xen.hg then?
> 
> Yes, even if it doesn't fix the exact bug we're seeing here, I think it
> should go in. I've attached a version with my signed-off-by and a better
> commit comment.

Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
vm_normal_page() return NULL? But actually I would expect pte_pfn() to
return max_mapnr because the mapped page is not a local page. And that
should cause vm_normal_page() to return NULL always, regardless of whether
you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
the test that causes the crash?

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Geoffrey Lefebvre
> Can we take a different approach from the zap_pte hook?  Given that
> we're 1) planning on claiming a pte bit for grant mappings, and 2) need
> to hook ptep_get_and_clear anyway to solve the mprotect performance
> problems, couldn't we just special-case grant mapping pte_clears?
>
> In 2.6.18-xen the only two implementations of zap_pte are
> blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
> grant-mapping bit set, could we determine which of these need calling
> and do the appropriate thing?  Do we even need separate implementations
> of the core pte-clearing functionality?  Could we just say something like:
>
> if (pte & _PAGE_XEN_FOREIGN)
> HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
> else
> xen_set_pte_at(...);
>

Hi,

In order to unmap a grant, you need the grant handle obtained when the
grant is mapped. That handle needs to be stored somewhere for the
lifetime of the mapping. Where would the handle be stored (as Gerd
proposed) in order to be able to unmap from ptep_get_and_clear_full?

I haven't looked at the paravirt ops in details so I could be missing
something obvious here.

cheers,

geoffrey
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Stephen C. Tweedie wrote:

So... the interface (a) cannot be used on the Linux VM without at least
one invasive VM modification, due to the requirement of ptes being
explicitly unmapped via hypercall;


Also there is the use of VM_FOREIGN 
(http://xenbits.xensource.com/linux-2.6.18-xen.hg?file/b2768401db94/mm/memory.c 
lines 1040--1059), which has been used quite happily in blktap since 
2005 
(http://lists.xensource.com/archives/html/xen-changelog/2005-07/msg00053.html). 
While it may not be a priority to get gntdev into pv-ops Linux, I should 
imagine that blktap would be fairly critical.



I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.


I'm open to suggestions... but I think it always reduces to needing a 
hook that is called on process exit before the PTEs are zapped.



(Just what is the cost of not having this functionality in blktap,
anyway?)


If tapdisk dies whilst holding a granted page, the page can never be 
ungranted, so we leak that page.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 14:12, "Gerd Hoffmann" <[EMAIL PROTECTED]> wrote:

>> Thanks for the repro details. I'll have a go at this later. One thing we
>> haven't tested AFAIK is mapping grants in the same domain: could you
>> check to see if the bug is the same if you attach a block device to a
>> domain other than Dom0? Also, could you send any Xen console output, if
>> it contains errors or warnings?
> 
> Attaching to another domain works better.  blkbackd needs some fixes as
> well though ...

Is this patch to go into linux-2.6.18-xen.hg then?

It needs a signed-off-by line if so.

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
  Hi,

> Thanks for the repro details. I'll have a go at this later. One thing we
> haven't tested AFAIK is mapping grants in the same domain: could you
> check to see if the bug is the same if you attach a block device to a
> domain other than Dom0? Also, could you send any Xen console output, if
> it contains errors or warnings?

Attaching to another domain works better.  blkbackd needs some fixes as
well though ...

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
  Hi,

> gntdev doesn't even try to handle forking.  I wouldn't be surprised if
> that is a great way to kill Domain-0.  The xen hypervisor will most
> likely not be amused to find a pte refering to a granted (but foreign)
> page which wasn't established using the grant table interface.  Pinning
> the pgd of the child process will most likely fail and make the kernel
> BUG().

Ok, isn't that bad thanks to the VM_DONTCOPY.  The child just doesn't
get the grant mapping.

cheers,
  Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Derek Murray

Keir Fraser wrote:

You'd need to track pte->grant_handle mappings somewhere, but it could
certainly be done this way, yes.


At the moment, blktap and gntdev provide struct pages to get_user_pages 
by smuggling them in the vm_private_data field of the relevant 
vm_area_struct. Could we use this field to get the handles to 
ptep_get_and_clear_full as well?


Only downside that I can see is that we would need to find the vma for 
each PTE that needs to be cleared this way (since we don't get this 
passed to ptep_get_and_clear_full), but this is mitigated by (i) it only 
happening in the erroneous, unclean-shutdown case, and (ii) getting a 
hit in the mm->mmap_cache for consecutive runs of mapped grants.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Gerd Hoffmann
Geoffrey Lefebvre wrote:
> In order to unmap a grant, you need the grant handle obtained when the
> grant is mapped. That handle needs to be stored somewhere for the
> lifetime of the mapping. Where would the handle be stored (as Gerd
> proposed) in order to be able to unmap from ptep_get_and_clear_full?

Sure. the kernel has to keep track of the grant mappings somewhere, so
it can lookup the grant handle from the available information.  Hashing
by machine address should work reasonable fast.  It's probably useful to
have an in-kernel API for that which then can be used by both gntdev and
the in-kernel backend drivers.  This API can also abstract out
arch-specific bits to make life easier for the ia64 guys ...

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Gerd Hoffmann
D.G. Murray wrote:
> Hi Mark, 
> 
>> Maybe a change to the gntdev userspace API to allow batching 
>> of mapping requests?
> 
> Something along the lines of the following?
> 
> void *xc_gnttab_map_grant_refs(int xcg_handle,
>uint32_t count,
>uint32_t *domids,
>uint32_t *refs,
>int prot); 

Yes, except that it should actually work ;)

It doesn't for me (Fedora 8 again).  Grab xenner 0.9 (just uploaded),
edit blkbackd.c and flip the BATCH_MAPS from 0 to 1, compile, run, see
it not work.

With BATCH_MAPS being 0 blkbackd works nicely as blktap/tapdisk drop-in
replacement.

cheers,
  Gerd
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Derek Murray

Gerd Hoffmann wrote:

Yes, except that it should actually work ;)

It doesn't for me (Fedora 8 again).  Grab xenner 0.9 (just uploaded),
edit blkbackd.c and flip the BATCH_MAPS from 0 to 1, compile, run, see
it not work.


Which version of the Xen tools are you using? There was a bug in the 
version released with Xen 3.1, which should have been cleaned up in the 
subsequent minor versions. Try grabbing the patch to libxc at:


http://xenbits.xensource.com/xen-3.1-testing.hg?raw-rev/135d5088909f

Otherwise, if this doesn't work/is some other issue, could you post the 
OOPS and relevant Xen console output?


Thanks,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Gerd Hoffmann

> Which version of the Xen tools are you using? There was a bug in the
> version released with Xen 3.1, which should have been cleaned up in the
> subsequent minor versions. Try grabbing the patch to libxc at:
> 
> http://xenbits.xensource.com/xen-3.1-testing.hg?raw-rev/135d5088909f

Probably it is this one, according to rpm version is 3.1.0, so most
likely the fix isn't there.

> Otherwise, if this doesn't work/is some other issue, could you post the
> OOPS and relevant Xen console output?

There isn't any, the mapping just doesn't work (libxc returning NULL).

thanks,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-06 Thread Jeremy Fitzhardinge
Derek Murray wrote:
> Keir Fraser wrote:
>> You'd need to track pte->grant_handle mappings somewhere, but it could
>> certainly be done this way, yes.
>
> At the moment, blktap and gntdev provide struct pages to
> get_user_pages by smuggling them in the vm_private_data field of the
> relevant vm_area_struct. Could we use this field to get the handles to
> ptep_get_and_clear_full as well?

Yes.  Given the mm and a vaddr passed to ptep_get_and_clear, find_vma()
will return the vma_struct.  If we assert that anyone who sets the "I'm
foreign" bit in a pte has a standard format for the vm_private_data
field, then we can stash a callback pointer there and make the
appropriate callback.

> Only downside that I can see is that we would need to find the vma for
> each PTE that needs to be cleared this way (since we don't get this
> passed to ptep_get_and_clear_full), but this is mitigated by (i) it
> only happening in the erroneous, unclean-shutdown case, and (ii)
> getting a hit in the mm->mmap_cache for consecutive runs of mapped
> grants.

Yes.  find_vma is fairly hot, since its used on every fault, so it
should be reasonably fast.  And it doesn't sound like our case is
particularly performance critical.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-12 Thread Keir Fraser
We already make the VM_FOREIGN check conditional on defined(CONFIG_XEN). We
could add defined(CONFIG_X86) as well? This would seem reasonable as a
temporary measure for the old 2.6.18 tree.

 -- Keir

On 12/12/07 08:27, "Isaku Yamahata" <[EMAIL PROTECTED]> wrote:

> This patch breaks blktap and gntdev on ia64.
> With auto translated physmap mode enabled, bktap/gntdev update
> the pte entry with vm_insert_page(). Not direct updating it with
> the hypercall.
> So when zapping the pte entry, it is necessary to release page
> reference counting, rmapping and etc. Thus vm_normal_page() have
> to return the struct page when auto translated physmap mode is enabled.
> 
> How about passing the page struct** to the zap_pte call back
> and set it to NULL if necessary?
> (or
> Can the condition be changed to check auto trasnalted physmap mode?
> or
> Should the clean up be done in zap_pte callback?)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-12 Thread Isaku Yamahata
On Wed, Dec 05, 2007 at 06:15:49PM +, Derek Murray wrote:
> Keir Fraser wrote:
> >Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
> >there any other responsibilities that you acquire if you make use of
> >VM_FOREIGN (in particular, how would this affect get_user_pages)?
> 
> VM_FOREIGN is already set for the gntdev VMA (mostly because it's 
> directly based on the blktap code). That means that it has the array of 
> page_structs in its vm_private_data, which can be used to fulfill a 
> get_user_pages call. I've attached a patch based on this fix.
> 
> Regards,
> 
> Derek.

Hi Derek. Sorry for this late alert.

This patch breaks blktap and gntdev on ia64.
With auto translated physmap mode enabled, bktap/gntdev update
the pte entry with vm_insert_page(). Not direct updating it with
the hypercall.
So when zapping the pte entry, it is necessary to release page
reference counting, rmapping and etc. Thus vm_normal_page() have
to return the struct page when auto translated physmap mode is enabled.

How about passing the page struct** to the zap_pte call back
and set it to NULL if necessary?
(or
Can the condition be changed to check auto trasnalted physmap mode?
or
Should the clean up be done in zap_pte callback?)
-- 
yamahata
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-12 Thread Isaku Yamahata
On Wed, Dec 12, 2007 at 08:39:41AM +, Keir Fraser wrote:
> We already make the VM_FOREIGN check conditional on defined(CONFIG_XEN). We
> could add defined(CONFIG_X86) as well? This would seem reasonable as a
> temporary measure for the old 2.6.18 tree.

Yes, ok for IA64.
-- 
yamahata
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-21 Thread Gerd Hoffmann
D.G. Murray wrote:
> Hi Mark, 
> void *xc_gnttab_map_grant_refs(int xcg_handle,
>uint32_t count,
>uint32_t *domids,
>uint32_t *refs,
>int prot); 

Fedora 8 has 3.1.2 packages now, still doesn't work for me though.

Bored at xmas?  Want try fixing it?  Fetch xenner 0.15 from
http://dl.bytesex.org/releases/xenner/, build ("make blkbackd"), run it
as drop-in replacement for blktap.  You have to pass the "-b" switch to
make it try batching grant maps.  Code is in ioreq_map(), blkbackd.c.

Oh, and I think the limit should better be raised.  32 requests with up
to 11 sectors each sums up to 352 pages.  Which is way beyound the
current 128 grants limit, so it may fail under heavy I/O load.

cheers and happy xmas,

  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization