Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-14 Thread Andrea Arcangeli
Hello Jason,

On Thu, Mar 14, 2019 at 09:49:03PM +0800, Jason Wang wrote:
> Yes since we don't want to slow down 32bit.

If you've a lot of ram there's no justification to stick to a 32bit
kernel, so I don't think there's need to maintain a separate model
just for 32bit. I really wouldn't care about the performance of 32bit
with >700MB of RAM if that would cause any maintenance burden. Let's
focus on the best 64bit implementation that will work equally
optimally on 32bit with <= 700M of RAM.

Talking to Jerome about the set_page_dirty issue, he raised the point
of what happens if two thread calls a mmu notifier invalidate
simultaneously. The first mmu notifier could call set_page_dirty and
then proceed in try_to_free_buffers or page_mkclean and then the
concurrent mmu notifier that arrives second, then must not call
set_page_dirty a second time.

With KVM sptes mappings and vhost mappings you would call
set_page_dirty (if you invoked gup with FOLL_WRITE) only when
effectively tearing down any secondary mapping (you've got pointers in
both cases for the mapping). So there's no way to risk a double
set_page_dirty from concurrent mmu notifier invalidate because the
invalidate takes a lock when it has to teardown the mapping and so
set_page_dirty is only run in the first invalidate method and not in
the second. In the spte case even better, as you wouldn't need to call
it even at teardown time unless the spte is dirty (if shadow mask
allows dirty sptes with EPT2 or NPT or shadow MMU).

If you instead had to invalidate a secondary MMU mapping that isn't
tracked by the driver (again: not vhost nor KVM case), you could have
used the dirty bit of the kernel pagetable to call set_page_dirty and
disambiguate but that's really messy, and it would prevent the use of
gigapages in the direct mapping too and it'd require vmap for 4k
tracking.

To make sure set_page_dirty is run a single time no matter if the
invalidate known when a mapping is tear down, I suggested the below
model:

  access = FOLL_WRITE

repeat:
  page = gup_fast(access)
  put_page(page) /* need a way to drop FOLL_GET from gup_fast instead! */

  spin_lock(mmu_notifier_lock);
  if (race with invalidate) {
spin_unlock..
goto repeat;
  }
  if (access == FOLL_WRITE)
set_page_dirty(page)
  establish writable mapping in secondary MMU on page
  spin_unlock

(replace spin_lock with mutex_lock for vhost of course if you stick to
a mutex and _start/_end instead of non-sleepable ->invalidate_range)

"race with invalidate" is the usual "mmu_notifier_retry" in kvm_host.h
to be implemented for vhost.

We could add a FOLL_DIRTY flag to add to FOLL_TOUCH to move the
set_page_dirty inside GUP forced (currently it's not forced if the
linux pte is already dirty). And we could remove FOLL_GET.

Then if you have the ability to disambiguate which is the first
invalidate that tears down a mapping to any given page (vhost can do
that trivially, it's just a pointer to a page struct to kmap), in the
mmu notifier invalidate just before dropping the spinlock you would
do this check:

def vhost_invalidate_range_start():
   [..]
   spin_lock(mmu_notifier_lock);
   [..]
   if (vhost->page_pointer) {
  if (access == FOLL_WRITE)
VM_WARN_ON(!PageDirty(vhost->page_pointer));
  vhost->page_pointer = NULL;
  /* no put_page, already done at gup time */
   }
   spin_unlock(..

Generally speaking set_page_dirty is safer done after the last
modification to the data of the page. However the way stable page
works, as long as the mmu notifier invalidate didn't run, the PG_dirty
cannot go away.

So this model solves the issue with guaranteeing a single
set_page_dirty is run before page_mkclean or try_to_free_buffers can
run, even for drivers that implement the invalidate as a generic "TLB
flush" across the whole secondary MMU and that cannot disambiguate the
first invalidate from a second invalidate if they're issued
concurrently on the same address by two different CPUs.

So for those drivers that can disambiguate trivially (like KVM and
vhost) we'll just add a debug check in the invalidate to validate the
common code for all mmu notifier users.

This is the solution for RDMA hardware and everything that can support
mmu notifiers too and they can take infinitely long secondary MMU
mappins without interfering with stable pages at all (i.e. long term
pins but without pinning) perfectly safe and transparent to the whole
stable page code.

I think O_DIRECT for stable pages shall be solved taking the page lock
or writeback lock or a new rwsem in the inode that is taken for
writing by page_mkclean and try_to_free_buffers and for reading by
outstanding O_DIRECT in flight I/O, like I suggested probably ages ago
but then we only made GUP take the page pin, which is fine for mmu
notifier actually (except those didn't exist back then). To solve
O_DIRECT we can leverage the 100% guarantee that the pin will be
dropped ASAP and stop page_mkclean and stop or trylock in
try_to_fr

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 03:02:54PM -0700, James Bottomley wrote:
> I'm sure there must be workarounds elsewhere in the other arch code
> otherwise things like this, which appear all over drivers/, wouldn't
> work:
> 
> drivers/scsi/isci/request.c:1430
> 
>   kaddr = kmap_atomic(page);
>   memcpy(kaddr + sg->offset, src_addr, copy_len);
>   kunmap_atomic(kaddr);
> 

Are you sure "page" is an userland page with an alias address?

sg->page_link = (unsigned long)virt_to_page(addr);

page_link seems to point to kernel memory.

I found an apparent solution like parisc on arm 32bit:

void __kunmap_atomic(void *kvaddr)
{
unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
int idx, type;

if (kvaddr >= (void *)FIXADDR_START) {
type = kmap_atomic_idx();
idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();

if (cache_is_vivt())
__cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE);

However on arm 64bit kunmap_atomic is not implemented at all and other
32bit implementations don't do it, for example sparc seems to do the
cache flush too if the kernel is built with CONFIG_DEBUG_HIGHMEM
(which makes the flushing conditional to the debug option).

The kunmap_atomic where fixmap is used, is flushing the tlb lazily so
even on 32bit you can't even be sure if there was a tlb flush for each
single page you unmapped, so it's hard to see how the above can work
safe, is "page" would have been an userland page mapped with aliased
CPU cache.

> the sequence dirties the kernel virtual address but doesn't flush
> before doing kunmap.  There are hundreds of other examples which is why
> I think adding flush_kernel_dcache_page() is an already lost cause.

In lots of cases kmap is needed to just modify kernel memory not to
modify userland memory (where get/put_user is more commonly used
instead..), there's no cache aliasing in such case.

> Actually copy_user_page() is unused in the main kernel.  The big
> problem is copy_user_highpage() but that's mostly highly optimised by
> the VIPT architectures (in other words you can fiddle with kmap without
> impacting it).

copy_user_page is not unused, it's called precisely by
copy_user_highpage, which is why the cache flushes are done inside
copy_user_page.

static inline void copy_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma)
{
char *vfrom, *vto;

vfrom = kmap_atomic(from);
vto = kmap_atomic(to);
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vto);
kunmap_atomic(vfrom);
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 02:19:15PM -0700, James Bottomley wrote:
> I mean in the sequence
> 
> flush_dcache_page(page);
> flush_dcache_page(page);
> 
> The first flush_dcache_page did all the work and the second it a
> tightly pipelined no-op.  That's what I mean by there not really being
> a double hit.

Ok I wasn't sure it was clear there was a double (profiling) hit on
that function.

void flush_kernel_dcache_page_addr(void *addr)
{
unsigned long flags;

flush_kernel_dcache_page_asm(addr);
purge_tlb_start(flags);
pdtlb_kernel(addr);
purge_tlb_end(flags);
}

#define purge_tlb_start(flags)  spin_lock_irqsave(&pa_tlb_lock, flags)
#define purge_tlb_end(flags)spin_unlock_irqrestore(&pa_tlb_lock, flags)

You got a system-wide spinlock in there that won't just go away the
second time. So it's a bit more than a tightly pipelined "noop".

Your logic of adding the flush on kunmap makes sense, all I'm saying
is that it's sacrificing some performance for safety. You asked
"optimized what", I meant to optimize away all the above quoted code
that will end running twice for each vhost set_bit when it should run
just once like in other archs. And it clearly paid off until now
(until now it run just once and it was the only safe one).

Before we can leverage your idea to flush the dcache on kunmap in
common code without having to sacrifice performance in arch code, we'd
need to change all other archs to add the cache flushes on kunmap too,
and then remove the cache flushes from the other places like copy_page
or we'd waste CPU. Then you'd have the best of both words, no double
flush and kunmap would be enough.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.

Except other archs like arm64 and sparc do the cache flushing on
copy_to_user_page and copy_user_page, not on kunmap.

#define copy_user_page(to,from,vaddr,pg) __cpu_copy_user_page(to, from, vaddr)
void __cpu_copy_user_page(void *kto, const void *kfrom, unsigned long vaddr)
{
struct page *page = virt_to_page(kto);
copy_page(kto, kfrom);
flush_dcache_page(page);
}
#define copy_user_page(to, from, vaddr, page)   \
do {copy_page(to, from);\
sparc_flush_page_to_ram(page);  \
} while (0)

And they do nothing on kunmap:

static inline void kunmap(struct page *page)
{
BUG_ON(in_interrupt());
if (!PageHighMem(page))
return;
kunmap_high(page);
}
void kunmap_high(struct page *page)
{
unsigned long vaddr;
unsigned long nr;
unsigned long flags;
int need_wakeup;
unsigned int color = get_pkmap_color(page);
wait_queue_head_t *pkmap_map_wait;

lock_kmap_any(flags);
vaddr = (unsigned long)page_address(page);
BUG_ON(!vaddr);
nr = PKMAP_NR(vaddr);

/*
 * A count must never go down to zero
 * without a TLB flush!
 */
need_wakeup = 0;
switch (--pkmap_count[nr]) {
case 0:
BUG();
case 1:
/*
 * Avoid an unnecessary wake_up() function call.
 * The common case is pkmap_count[] == 1, but
 * no waiters.
 * The tasks queued in the wait-queue are guarded
 * by both the lock in the wait-queue-head and by
 * the kmap_lock.  As the kmap_lock is held here,
 * no need for the wait-queue-head's lock.  Simply
 * test if the queue is empty.
 */
pkmap_map_wait = get_pkmap_wait_queue_head(color);
need_wakeup = waitqueue_active(pkmap_map_wait);
}
unlock_kmap_any(flags);

/* do wake-up, if needed, race-free outside of the spin lock */
if (need_wakeup)
wake_up(pkmap_map_wait);
}
static inline void kunmap(struct page *page)
{
}

because they already did it just above.


> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.

The double hit is in parisc copy_to_user_page:

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
flush_cache_page(vma, vaddr, page_to_pfn(page)); \
memcpy(dst, src, len); \
flush_kernel_dcache_range_asm((unsigned long)dst, (unsigned long)dst + 
len); \
} while (0)

That is executed just before kunmap:

static inline void kunmap(struct page *page)
{
flush_kernel_dcache_page_addr(page_address(page));
}

Can't argue about the fact your "safer" kunmap is safer, but we cannot
rely on common code unless we remove some optimization from the common
code abstractions and we make all archs do kunmap like parisc.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-12 Thread Andrea Arcangeli
On Tue, Mar 12, 2019 at 08:46:50AM -0700, James Bottomley wrote:
> On Tue, 2019-03-12 at 07:54 -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 12, 2019 at 03:17:00PM +0800, Jason Wang wrote:
> > > 
> > > On 2019/3/12 上午11:52, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 12, 2019 at 10:59:09AM +0800, Jason Wang wrote:
> [...]
> > > At least for -stable, we need the flush?
> > > 
> > > 
> > > > Three atomic ops per bit is way to expensive.
> > > 
> > > 
> > > Yes.
> > > 
> > > Thanks
> > 
> > See James's reply - I stand corrected we do kunmap so no need to
> > flush.
> 
> Well, I said that's what we do on Parisc.  The cachetlb document
> definitely says if you alter the data between kmap and kunmap you are
> responsible for the flush.  It's just that flush_dcache_page() is a no-
> op on x86 so they never remember to add it and since it will crash
> parisc if you get it wrong we finally gave up trying to make them.
> 
> But that's the point: it is a no-op on your favourite architecture so
> it costs you nothing to add it.

Yes, the fact Parisc gave up and is doing it on kunmap is reasonable
approach for Parisc, but it doesn't move the needle as far as vhost
common code is concerned, because other archs don't flush any cache on
kunmap.

So either all other archs give up trying to optimize, or vhost still
has to call flush_dcache_page() after kunmap.

Which means after we fix vhost to add the flush_dcache_page after
kunmap, Parisc will get a double hit (but it also means Parisc was the
only one of those archs needed explicit cache flushes, where vhost
worked correctly so far.. so it kinds of proofs your point of giving
up being the safe choice).

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-11 Thread Andrea Arcangeli
On Mon, Mar 11, 2019 at 08:48:37AM -0400, Michael S. Tsirkin wrote:
> Using copyXuser is better I guess.

It certainly would be faster there, but I don't think it's needed if
that would be the only use case left that justifies supporting two
different models. On small 32bit systems with little RAM kmap won't
perform measurably different on 32bit or 64bit systems. If the 32bit
host has a lot of ram it all gets slow anyway at accessing RAM above
the direct mapping, if compared to 64bit host kernels, it's not just
an issue for vhost + mmu notifier + kmap and the best way to optimize
things is to run 64bit host kernels.

Like Christoph pointed out, the main use case for retaining the
copy-user model would be CPUs with virtually indexed not physically
tagged data caches (they'll still suffer from the spectre-v1 fix,
although I exclude they have to suffer the SMAP
slowdown/feature). Those may require some additional flushing than the
current copy-user model requires.

As a rule of thumb any arch where copy_user_page doesn't define as
copy_page will require some additional cache flushing after the
kmap. Supposedly with vmap, the vmap layer should have taken care of
that (I didn't verify that yet).

There are some accessories like copy_to_user_page()
copy_from_user_page() that could work and obviously defines to raw
memcpy on x86 (the main cons is they don't provide word granular
access) and at least on sparc they're tailored to ptrace assumptions
so then we'd need to evaluate what happens if this is used outside of
ptrace context. kmap has been used generally either to access whole
pages (i.e. copy_user_page), so ptrace may actually be the only use
case with subpage granularity access.

#define copy_to_user_page(vma, page, vaddr, dst, src, len)  \
do {\
flush_cache_page(vma, vaddr, page_to_pfn(page));\
memcpy(dst, src, len);  \
flush_ptrace_access(vma, page, vaddr, src, len, 0); \
} while (0)

So I wouldn't rule out the need for a dual model, until we solve how
to run this stable on non-x86 arches with not physically tagged
caches.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-08 Thread Andrea Arcangeli
Hello Jeson,

On Fri, Mar 08, 2019 at 04:50:36PM +0800, Jason Wang wrote:
> Just to make sure I understand here. For boosting through huge TLB, do 
> you mean we can do that in the future (e.g by mapping more userspace 
> pages to kenrel) or it can be done by this series (only about three 4K 
> pages were vmapped per virtqueue)?

When I answered about the advantages of mmu notifier and I mentioned
guaranteed 2m/gigapages where available, I overlooked the detail you
were using vmap instead of kmap. So with vmap you're actually doing
the opposite, it slows down the access because it will always use a 4k
TLB even if QEMU runs on THP or gigapages hugetlbfs.

If there's just one page (or a few pages) in each vmap there's no need
of vmap, the linearity vmap provides doesn't pay off in such
case.

So likely there's further room for improvement here that you can
achieve in the current series by just dropping vmap/vunmap.

You can just use kmap (or kmap_atomic if you're in preemptible
section, should work from bh/irq).

In short the mmu notifier to invalidate only sets a "struct page *
userringpage" pointer to NULL without calls to vunmap.

In all cases immediately after gup_fast returns you can always call
put_page immediately (which explains why I'd like an option to drop
FOLL_GET from gup_fast to speed it up).

Then you can check the sequence_counter and inc/dec counter increased
by _start/_end. That will tell you if the page you got and you called
put_page to immediately unpin it or even to free it, cannot go away
under you until the invalidate is called.

If sequence counters and counter tells that gup_fast raced with anyt
mmu notifier invalidate you can just repeat gup_fast. Otherwise you're
done, the page cannot go away under you, the host virtual to host
physical mapping cannot change either. And the page is not pinned
either. So you can just set the "struct page * userringpage = page"
where "page" was the one setup by gup_fast.

When later the invalidate runs, you can just call set_page_dirty if
gup_fast was called with "write = 1" and then you clear the pointer
"userringpage = NULL".

When you need to read/write to the memory
kmap/kmap_atomic(userringpage) should work.

In short because there's no hardware involvement here, the established
mapping is just the pointer to the page, there is no need of setting
up any pagetables or to do any TLB flushes (except on 32bit archs if
the page is above the direct mapping but it never happens on 64bit
archs).

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-08 Thread Andrea Arcangeli
On Fri, Mar 08, 2019 at 04:58:44PM +0800, Jason Wang wrote:
> Can I simply can set_page_dirty() before vunmap() in the mmu notifier 
> callback, or is there any reason that it must be called within vumap()?

I also don't see any problem in doing it before vunmap. As far as the
mmu notifier and set_page_dirty is concerned vunmap is just
put_page. It's just slower and potentially unnecessary.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-08 Thread Andrea Arcangeli
On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang wrote:
> Actually not wrapping around,  the pages for used ring was marked as 
> dirty after a round of virtqueue processing when we're sure vhost wrote 
> something there.

Thanks for the clarification. So we need to convert it to
set_page_dirty and move it to the mmu notifier invalidate but in those
cases where gup_fast was called with write=1 (1 out of 3).

If using ->invalidate_range the page pin also must be removed
immediately after get_user_pages returns (not ok to hold the pin in
vmap until ->invalidate_range is called) to avoid false positive gup
pin checks in things like KSM, or the pin must be released in
invalidate_range_start (which is called before the pin checks).

Here's why:

/*
 * Check that no O_DIRECT or similar I/O is in progress on the
 * page
 */
if (page_mapcount(page) + 1 + swapped != page_count(page)) {
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
goto out_unlock;
}
[..]
set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
  ^^^ too late release the pin here, the
  above already failed

->invalidate_range cannot be used with mutex anyway so you need to go
back with invalidate_range_start/end anyway, just the pin must be
released in _start at the latest in such case.

My prefer is generally to call gup_fast() followed by immediate
put_page() because I always want to drop FOLL_GET from gup_fast as a
whole to avoid 2 useless atomic ops per gup_fast.

I'll write more about vmap in answer to the other email.

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-07 Thread Andrea Arcangeli
Hello Jerome,

On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse wrote:
> So for the above the easiest thing is to call set_page_dirty() from
> the mmu notifier callback. It is always safe to use the non locking
> variant from such callback. Well it is safe only if the page was
> map with write permission prior to the callback so here i assume
> nothing stupid is going on and that you only vmap page with write
> if they have a CPU pte with write and if not then you force a write
> page fault.

So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply shouldn't
be called in such case. It only ever makes sense if the pte is
writable.

On a side note, the reason the write bit on the pte enabled avoids the
need of the _lock suffix is because of the stable page writeback
guarantees?

> Basicly from mmu notifier callback you have the same right as zap
> pte has.

Good point.

Related to this I already was wondering why the set_page_dirty is not
done in the invalidate. Reading the patch it looks like the dirty is
marked dirty when the ring wraps around, not in the invalidate, Jeson
can tell if I misread something there.

For transient data passing through the ring, nobody should care if
it's lost. It's not user-journaled anyway so it could hit the disk in
any order. The only reason to flush it to do disk is if there's memory
pressure (to pageout like a swapout) and in such case it's enough to
mark it dirty only in the mmu notifier invalidate like you pointed out
(and only if GUP was called with FOLL_WRITE).

> O_DIRECT can suffer from the same issue but the race window for that
> is small enough that it is unlikely it ever happened. But for device

Ok that clarifies things.

> driver that GUP page for hours/days/weeks/months ... obviously the
> race window is big enough here. It affects many fs (ext4, xfs, ...)
> in different ways. I think ext4 is the most obvious because of the
> kernel log trace it leaves behind.
> 
> Bottom line is for set_page_dirty to be safe you need the following:
> lock_page()
> page_mkwrite()
> set_pte_with_write()
> unlock_page()

I also wondered why ext4 writepage doesn't recreate the bh if they got
dropped by the VM and page->private is 0. I mean, page->index and
page->mapping are still there, that's enough info for writepage itself
to take a slow path and calls page_mkwrite to find where to write the
page on disk.

> Now when loosing the write permission on the pte you will first get
> a mmu notifier callback so anyone that abide by mmu notifier is fine
> as long as they only write to the page if they found a pte with
> write as it means the above sequence did happen and page is write-
> able until the mmu notifier callback happens.
>
> When you lookup a page into the page cache you still need to call
> page_mkwrite() before installing a write-able pte.
> 
> Here for this vmap thing all you need is that the original user
> pte had the write flag. If you only allow write in the vmap when
> the original pte had write and you abide by mmu notifier then it
> is ok to call set_page_dirty from the mmu notifier (but not after).
> 
> Hence why my suggestion is a special vunmap that call set_page_dirty
> on the page from the mmu notifier.

Agreed, that will solve all issues in vhost context with regard to
set_page_dirty, including the case the memory is backed by VM_SHARED ext4.

Thanks!
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-07 Thread Andrea Arcangeli
On Thu, Mar 07, 2019 at 02:09:10PM -0500, Jerome Glisse wrote:
> I thought this patch was only for anonymous memory ie not file back ?

Yes, the other common usages are on hugetlbfs/tmpfs that also don't
need to implement writeback and are obviously safe too.

> If so then set dirty is mostly useless it would only be use for swap
> but for this you can use an unlock version to set the page dirty.

It's not a practical issue but a security issue perhaps: you can
change the KVM userland to run on VM_SHARED ext4 as guest physical
memory, you could do that with the qemu command line that is used to
place it on tmpfs or hugetlbfs for example and some proprietary KVM
userland may do for other reasons. In general it shouldn't be possible
to crash the kernel with this, and it wouldn't be nice to fail if
somebody decides to put VM_SHARED ext4 (we could easily allow vhost
ring only backed by anon or tmpfs or hugetlbfs to solve this of
course).

It sounds like we should at least optimize away the _lock from
set_page_dirty if it's anon/hugetlbfs/tmpfs, would be nice if there
was a clean way to do that.

Now assuming we don't nak the use on ext4 VM_SHARED and we stick to
set_page_dirty_lock for such case: could you recap how that
__writepage ext4 crash was solved if try_to_free_buffers() run on a
pinned GUP page (in our vhost case try_to_unmap would have gotten rid
of the pins through the mmu notifier and the page would have been
freed just fine).

The first two things that come to mind is that we can easily forbid
the try_to_free_buffers() if the page might be pinned by GUP, it has
false positives with the speculative pagecache lookups but it cannot
give false negatives. We use those checks to know when a page is
pinned by GUP, for example, where we cannot merge KSM pages with gup
pins etc... However what if the elevated refcount wasn't there when
try_to_free_buffers run and is there when __remove_mapping runs?

What I mean is that it sounds easy to forbid try_to_free_buffers for
the long term pins, but that still won't prevent the same exact issue
for a transient pin (except the window to trigger it will be much smaller).

I basically don't see how long term GUP pins breaks stuff in ext4
while transient short term GUP pins like O_DIRECT don't. The VM code
isn't able to disambiguate if the pin is short or long term and it
won't even be able to tell the difference between a GUP pin (long or
short term) and a speculative get_page_unless_zero run by the
pagecache speculative pagecache lookup. Even a random speculative
pagecache lookup that runs just before __remove_mapping, can cause
__remove_mapping to fail despite try_to_free_buffers() succeeded
before it (like if there was a transient or long term GUP
pin). speculative lookup that can happen across all page struct at all
times and they will cause page_ref_freeze in __remove_mapping to
fail.

I'm sure I'm missing details on the ext4 __writepage problem and how
set_page_dirty_lock broke stuff with long term GUP pins, so I'm
asking...

Thanks!
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

2019-03-07 Thread Andrea Arcangeli
On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > + .invalidate_range = vhost_invalidate_range,
> > > +};
> > > +
> > >  void vhost_dev_init(struct vhost_dev *dev,
> > >   struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> > >  {
> > 
> > I also wonder here: when page is write protected then
> > it does not look like .invalidate_range is invoked.
> > 
> > E.g. mm/ksm.c calls
> > 
> > mmu_notifier_invalidate_range_start and
> > mmu_notifier_invalidate_range_end but not mmu_notifier_invalidate_range.
> > 
> > Similarly, rmap in page_mkclean_one will not call
> > mmu_notifier_invalidate_range.
> > 
> > If I'm right vhost won't get notified when page is write-protected since you
> > didn't install start/end notifiers. Note that end notifier can be called
> > with page locked, so it's not as straight-forward as just adding a call.
> > Writing into a write-protected page isn't a good idea.
> > 
> > Note that documentation says:
> > it is fine to delay the mmu_notifier_invalidate_range
> > call to mmu_notifier_invalidate_range_end() outside the page table lock.
> > implying it's called just later.
> 
> OK I missed the fact that _end actually calls
> mmu_notifier_invalidate_range internally. So that part is fine but the
> fact that you are trying to take page lock under VQ mutex and take same
> mutex within notifier probably means it's broken for ksm and rmap at
> least since these call invalidate with lock taken.

Yes this lock inversion needs more thoughts.

> And generally, Andrea told me offline one can not take mutex under
> the notifier callback. I CC'd Andrea for why.

Yes, the problem then is the ->invalidate_page is called then under PT
lock so it cannot take mutex, you also cannot take the page_lock, it
can at most take a spinlock or trylock_page.

So it must switch back to the _start/_end methods unless you rewrite
the locking.

The difference with _start/_end, is that ->invalidate_range avoids the
_start callback basically, but to avoid the _start callback safely, it
has to be called in between the ptep_clear_flush and the set_pte_at
whenever the pfn changes like during a COW. So it cannot be coalesced
in a single TLB flush that invalidates all sptes in a range like we
prefer for performance reasons for example in KVM. It also cannot
sleep.

In short ->invalidate_range must be really fast (it shouldn't require
to send IPI to all other CPUs like KVM may require during an
invalidate_range_start) and it must not sleep, in order to prefer it
to _start/_end.

I.e. the invalidate of the secondary MMU that walks the linux
pagetables in hardware (in vhost case with GUP in software) has to
happen while the linux pagetable is zero, otherwise a concurrent
hardware pagetable lookup could re-instantiate a mapping to the old
page in between the set_pte_at and the invalidate_range_end (which
internally calls ->invalidate_range). Jerome documented it nicely in
Documentation/vm/mmu_notifier.rst .

Now you don't really walk the pagetable in hardware in vhost, but if
you use gup_fast after usemm() it's similar.

For vhost the invalidate would be really fast, there are no IPI to
deliver at all, the problem is just the mutex.

> That's a separate issue from set_page_dirty when memory is file backed.

Yes. I don't yet know why the ext4 internal __writepage cannot
re-create the bh if they've been freed by the VM and why such race
where the bh are freed for a pinned VM_SHARED ext4 page doesn't even
exist for transient pins like O_DIRECT (does it work by luck?), but
with mmu notifiers there are no long term pins anyway, so this works
normally and it's like the memory isn't pinned. In any case I think
that's a kernel bug in either __writepage or try_to_free_buffers, so I
would ignore it considering qemu will only use anon memory or tmpfs or
hugetlbfs as backing store for the virtio ring. It wouldn't make sense
for qemu to risk triggering I/O on a VM_SHARED ext4, so we shouldn't
be even exposed to what seems to be an orthogonal kernel bug.

I suppose whatever solution will fix the set_page_dirty_lock on
VM_SHARED ext4 for the other places that don't or can't use mmu
notifiers, will then work for vhost too which uses mmu notifiers and
will be less affected from the start if something.

Reading the lwn link about the discussion about the long term GUP pin
from Jan vs set_page_dirty_lock: I can only agree with the last part
where Jerome correctly pointed out at the end that mellanox RDMA got
it right by avoiding completely long term pins by using mmu notifier
and in general mmu notifier is the standard solution to avoid long
term pins. Nothing should ever take long term GUP pins, if it does it
means software is bad or the hardware lacks features to support on
demand paging. Still I

Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

2016-12-19 Thread Andrea Arcangeli
Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
> 
> Signed-off-by: Pan Xinhui 
> Acked-by: Paolo Bonzini 
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
>  arch/x86/kvm/x86.c   | 16 
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> + return;
> +
> + vcpu->arch.st.steal.preempted = 1;
> +
> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> + &vcpu->arch.st.steal.preempted,
> + offsetof(struct kvm_steal_time, preempted),
> + sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> + kvm_steal_time_set_preempted(vcpu);
>   kvm_x86_ops->vcpu_put(vcpu);
>   kvm_put_guest_fpu(vcpu);
>   vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

>From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
 kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/kvm/x86.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu 
*vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   /*
+* Disable page faults because we're in atomic context here.
+* kvm_write_guest_offset_cached() would call might_fault()
+* that relies on pagefault_disable() to tell if there's a
+* bug. NOTE: the write to guest memory may not go through if
+* during postcopy live migration or if there's heavy guest
+* paging.
+*/
+   pagefault_disable();
kvm_steal_time_set_preempted(vcpu);
+   pagefault_enable();
    kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();


>From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/kvm/x86.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
---

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-16 Thread Andrea Arcangeli
On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > 
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 
> > 8 bytes".
> > 
> > That will be much more simple. Right?
> 
> Sounds fine to me.
> 

Sounds fine to me too indeed.

I'm only wondering what is the major point for compressing gpfn+len in
8 bytes in the common case, you already use sg_init_table to send down
two pages, we could send three as well and avoid all math and bit
shifts and ors, or not?

I agree with the above because from a performance prospective I tend
to think the above proposal will run at least theoretically faster
because the other way is to waste double amount of CPU cache, and bit
mangling in the encoding and the later decoding on qemu side should be
faster than accessing an array of double size, but then I'm not sure
if it's measurable optimization. So I'd be curious to know the exact
motivation and if it is to reduce the CPU cache usage or if there's
some other fundamental reason to compress it.

The header already tells qemu how big is the array payload, couldn't
we just add more pages if one isn't enough?

Thanks,
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-16 Thread Andrea Arcangeli
On Fri, Dec 16, 2016 at 01:12:21AM +, Li, Liang Z wrote:
> There still exist the case if the MAX_ORDER is configured to a large value, 
> e.g. 36 for a system
> with huge amount of memory, then there is only 28 bits left for the pfn, 
> which is not enough.

Not related to the balloon but how would it help to set MAX_ORDER to
36?

What the MAX_ORDER affects is that you won't be able to ask the kernel
page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
that's a driver issue not relevant to the amount of RAM. Drivers won't
suddenly start to ask the kernel allocator to allocate compound pages
at orders >= 11 just because more RAM was added.

The higher the MAX_ORDER the slower the kernel runs simply so the
smaller the MAX_ORDER the better.

> Should  we limit the MAX_ORDER? I don't think so.

We shouldn't strictly depend on MAX_ORDER value but it's mostly
limited already even if configurable at build time.

We definitely need it to reach at least the hugepage size, then it's
mostly driver issue, but drivers requiring large contiguous
allocations should rely on CMA only or vmalloc if they only require it
virtually contiguous, and not rely on larger MAX_ORDER that would
slowdown all kernel allocations/freeing.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-09 Thread Andrea Arcangeli
Hello,

On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > What's the conclusion of your discussion? It seems you want some
> > > statistic before deciding whether to  ripping the bitmap from the ABI,
> > > am I right?
> > 
> > I think Andrea and David feel pretty strongly that we should remove the
> > bitmap, unless we have some data to support keeping it.  I don't feel as
> > strongly about it, but I think their critique of it is pretty valid.  I 
> > think the
> > consensus is that the bitmap needs to go.
> > 
> 
> Thanks for you clarification.
> 
> > The only real question IMNHO is whether we should do a power-of-2 or a
> > length.  But, if we have 12 bits, then the argument for doing length is 
> > pretty
> > strong.  We don't need anywhere near 12 bits if doing power-of-2.
> > 
> So each item can max represent 16MB Bytes, seems not big enough,
> but enough for most case.
> Things became much more simple without the bitmap, and I like simple solution 
> too. :)
> 
> I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Sounds great!

I suggested to check the statistics, because collecting those stats
looked simpler and quicker than removing all bitmap related stuff from
the patchset. However if you prefer to prepare a v6 without the bitmap
another perhaps more interesting way to evaluate the usefulness of the
bitmap is to just run the same benchmark and verify that there is no
regression compared to the bitmap enabled code.

The other issue with the bitmap is, the best case for the bitmap is
ever less likely to materialize the more RAM is added to the guest. It
won't regress linearly because after all there can be some locality
bias in the buddy splits, but if sync compaction is used in the large
order allocations tried before reaching order 0, the bitmap payoff
will regress close to linearly with the increase of RAM.

So it'd be good to check the stats or the benchmark on large guests,
at least one hundred gigabytes or so.

Changing topic but still about the ABI features needed, so it may be
relevant for this discussion:

1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
   memory from, using alloc_pages_node in guest. So you can ask to
   take X pages from vnode A, Y pages from vnode B, in one vmenter.

2) allowing qemu to tell the guest to stop inflating the balloon and
   report a fragmentation limit being hit, when sync compaction
   powered allocations fails at certain power-of-two order granularity
   passed by qemu to the guest. This order constraint will be passed
   by default for hugetlbfs guests with 2MB hpage size, while it can
   be used optionally on THP backed guests. This option with THP
   guests would allow a highlevel management software to provide a
   "don't reduce guest performance" while shrinking the memory size of
   the guest from the GUI. If you deselect the option, you can shrink
   down to the last freeable 4k guest page, but doing so may have to
   split THP in the host (you don't know for sure if they were really
   THP but they could have been), and it may regress
   performance. Inflating the balloon while passing a minimum
   granularity "order" of the pages being zapped, will guarantee
   inflating the balloon cannot decrease guest performance
   instead. Plus it's needed for hugetlbfs anyway as far as I can
   tell. hugetlbfs would not be host enforceable even if the idea is
   not to free memory but only reduce the available memory of the
   guest (not without major changes that maps a hugetlb page with 4k
   ptes at least). While for a more cooperative usage of hugetlbfs
   guests, it's simply not useful to inflate the balloon at anything
   less than the "HPAGE_SIZE" hugetlbfs granularity.

We also plan to use userfaultfd to make the balloon driver host
enforced (will work fine on hugetlbfs 2M and tmpfs too) but that's
going to be invisible to the ABI so it's not strictly relevant for
this discussion.

On a side note, registering userfaultfd on the ballooned range, will
keep khugepaged at bay so it won't risk to re-inflating the
MADV_DONTNEED zapped sub-THP fragments no matter the sysfs tunings.

Thanks!
Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>This is inefficient, but independent from the underlying data
>structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>guest implementation today?  Andrea says they have zero benefits
>over a pfn/len scheme.  Dave doesn't think they have zero benefits
>but isn't that attached to them.  QEMU's handling gets more
>complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>pfn/order pair?
>3a. 'len' is more flexible, but will always be a power-of-two anyway
>   for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>3b. if we decide not to have a bitmap, then we basically have plenty
>   of space for 'len' and should just do it
>3c. It's easiest for the hypervisor to turn pfn/len into the
>madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
On Wed, Dec 07, 2016 at 10:44:31AM -0800, Dave Hansen wrote:
> On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> >> > and leaves room for the bitmap size to be encoded as well, if we decide
> >> > we need a bitmap in the future.
> > How would a bitmap ever be useful with very large page-order?
> 
> Please, guys.  Read the patches.  *Please*.

I did read the code but you didn't answer my question.

Why should a feature exist in the code that will never be useful. Why
do you think we could ever decide we'll need the bitmap in the future
for high order pages?

> The current code doesn't even _use_ a bitmap.

It's not using it right now, my question is exactly when it will ever
use it?

Leaving the bitmap only for order 0 allocations when you already wiped
all high pages orders from the buddy, doesn't seem very good idea
overall as the chance you got order 0 pages with close physical
address doesn't seem very high. It would be high if the loop that eat
into every possible higher order didn't run first, but such loop just
run and already wiped everything.

Also note, we need to call compaction very aggressive before falling
back from order 9 down to order 8. Ideally we should never use the
page_shift = PAGE_SHIFT case at all! Which leaves the bitmap as best
as an optimization for something that is suboptimal case already. If
the bitmap starts to payoff it means the admin did a mistake and
shrunk the guest too much.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
Hello,

On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
> It is more space-efficient.  We're fitting the order into 6 bits, which
> would allows the full 2^64 address space to be represented in one entry,

Very large order is the same as very large len, 6 bits of order or 8
bytes of len won't really move the needle here, simpler code is
preferable.

The main benefit of "len" is that it can be more granular, plus it's
simpler than the bitmap too. Eventually all this stuff has to end up
into a madvisev (not yet upstream but somebody posted it for jemalloc
and should get merged eventually).

So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
won't ever be sent to the madvise syscall, which makes the
intermediate representation with the bitmap a complication with
basically no benefits compared to a (N, [addr1,len1], .., [addrN,
lenN]) representation.

If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len
that's possible too, I wouldn't be against that, the conversion before
calling madvise would be pretty efficient too.

> and leaves room for the bitmap size to be encoded as well, if we decide
> we need a bitmap in the future.

How would a bitmap ever be useful with very large page-order?

> If that was purely a length, we'd be limited to 64*4k pages per entry,
> which isn't even a full large page.

I don't follow here.

What we suggest is to send the data down represented as (N,
[addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
one of maximum length 2^64, so 2^64 multiplied infinite times if you
wish. Simplifying the code and not having any bitmap at all and no :6
:6 bits either.

The high order to low order loop of allocations is the interesting part
here, not the bitmap, and the fact of doing a single vmexit to send
the large ranges.

Once we pull out the largest order regions, we just add them to the
array as [addr,1UL

Re: [patch 0/6] Guest page hinting version 6.

2008-03-13 Thread Andrea Arcangeli
On Thu, Mar 13, 2008 at 10:45:07AM -0700, Zachary Amsden wrote:
> What doesn't appear to be useful however, is support for this under
> VMware.  It can be done, even without the writable pte support (yes,
> really).  But due to us exploiting optimizations at lower layers, it
> doesn't appear that it will gain us any performance - and we must
> already have the complex working set algorithms to support
> non-paravirtualized guests.

With non-paravirt all you can do is to swap the guest physical memory
(mmu notifiers allows linux to do that) or share memory (mmu notifiers
+ ksm allows linux to do that too). We also have complex working set
algorithms that we use to finds which parts of the guest physical
address space are best to swap first: the core linux VM.

What paravirt allows us to do (and that's the whole point of the paper
I guess), is to go one step further than just guest swapping and to
ask the guest if the page really need to be swapped or if it can be
freed right away. So this would be an extension of the mmu notifiers
(this also shows how EMM API is too restrictive, while MMU notifiers
will allow that extension in the future) to avoid I/O sometime if
guest tells us it's not necessary to swap through paravirt ops.

When talking with friends about ballooning I already once suggested to
auto inflate the balloon with pages in the freelist.

Now this paper goes well beyond the pages in the freelist (called
U/unused in the paper), this also covers cache and mapped-clean cache
in the guest. That would have been the next step.

Anyway plain ballooning remains useful as rss limiting or numa
compartments in the linux hypervisor, to provide unfariness to certain
guests.

I didn't read the patch yet, but I think paravirt knowledge about
U/unused pages is needed to avoid guest swapping. The cache and mapped
cache in the guest is a gray area, because linux as hypervisor will be
extremely efficient at swapping out and swapping in the guest cache
(host swapping guest cache, may be faster than re-issuing a read-I/O
to refill the cache by itself, clearly with guest using
paravirt). Let's say I'm mostly interested about page-hinting for the
U pages initially.

I'm currently busy with other two features and trying to get mmu
notifier #v9 into mainline which is orders of magnitude more important
than avoiding a few swapouts sometime (without mmu notifiers
everything else is irrelevant, including guest page hinting and
including ballooning too cause madvise(don't need) won't clear sptes
and invalidate guest tlbs).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [kvm-devel] [PATCH] kvm guest balloon driver

2008-01-09 Thread Andrea Arcangeli
On Tue, Jan 08, 2008 at 09:42:13AM -0600, Anthony Liguori wrote:
> Instead of allocating a node for each page, you could use page->private 

page->lru is probably better for this so splice still works
etc... (the struct page isn't visible to the guest VM so it's free to
use)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization