Re: [virtio-dev] Re: guest / host buffer sharing ...

2019-12-12 Thread David Stevens
> > > Without buffer sharing support the driver importing a virtio-gpu dma-buf
> > > can send the buffer scatter list to the host.  So both virtio-gpu and
> > > the other device would actually access the same guest pages, but they
> > > are not aware that the buffer is shared between devices.
> >
> > With the uuid approach, how should this case be handled? Should it be
> > equivalent to exporting and importing the buffer which was created
> > first? Should the spec say it's undefined behavior that might work as
> > expected but might not, depending on the device implementation? Does
> > the spec even need to say anything about it?
>
> Using the uuid is an optional optimization.  I'd expect the workflow be
> roughly this:
>
>   (1) exporting driver exports a dma-buf as usual, additionally attaches
>   a uuid to it and notifies the host (using device-specific commands).
>   (2) importing driver will ask the host to use the buffer referenced by
>   the given uuid.
>   (3) if (2) fails for some reason use the dma-buf scatter list instead.
>
> Of course only virtio drivers would try step (2), other drivers (when
> sharing buffers between intel gvt device and virtio-gpu for example)
> would go straight to (3).

For virtio-gpu as it is today, it's not clear to me that they're
equivalent. As I read it, the virtio-gpu spec makes a distinction
between the guest memory and the host resource. If virtio-gpu is
communicating with non-virtio devices, then obviously you'd just be
working with guest memory. But if it's communicating with another
virtio device, then there are potentially distinct guest and host
buffers that could be used. The spec shouldn't leave any room for
ambiguity as to how this distinction is handled.

> > Not just buffers not backed by guest ram, but things like fences. I
> > would suggest the uuids represent 'exported resources' rather than
> > 'exported buffers'.
>
> Hmm, I can't see how this is useful.  Care to outline how you envision
> this to work in a typical use case?

Looking at the spec again, it seems like there's some more work that
would need to be done before this would be possible. But the use case
I was thinking of would be to export a fence from virtio-gpu and share
it with a virtio decoder, to set up a decode pipeline that doesn't
need to go back into the guest for synchronization. I'm fine dropping
this point for now, though, and revisiting it as a separate proposal.

-David

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 13/13] virtio-mem: Drop slab objects when unplug continues to fail

2019-12-12 Thread David Hildenbrand
Start dropping slab objects after 30 minutes and repeat every 30 minutes
in case we can't unplug more memory using alloc_contig_range().
Log messages and make it configurable. Enable dropping slab objects as
default (especially, reclaimable slab objects that are not movable).

In the future, we might want to implement+use drop_slab_range(), which
will also come in handy for other users (e.g., offlining, gigantic huge
pages).

Cc: Alexander Viro 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 3a57434f92ed..8f25f7453a08 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -25,6 +25,11 @@ static bool unplug_online = true;
 module_param(unplug_online, bool, 0644);
 MODULE_PARM_DESC(unplug_online, "Try to unplug online memory");
 
+static bool drop_slab_objects = true;
+module_param(drop_slab_objects, bool, 0644);
+MODULE_PARM_DESC(drop_slab_objects,
+"Drop slab objects when unplug continues to fail");
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -1384,6 +1389,7 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
 static int virtio_mem_unplug_request(struct virtio_mem *vm, uint64_t diff)
 {
uint64_t nb_sb = diff / vm->subblock_size;
+   bool retried = false;
unsigned long mb_id;
int rc;
 
@@ -1421,6 +1427,7 @@ static int virtio_mem_unplug_request(struct virtio_mem 
*vm, uint64_t diff)
return 0;
}
 
+retry_locked:
/* Try to unplug subblocks of partially plugged online blocks. */
virtio_mem_for_each_mb_state(vm, mb_id,
 VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL) {
@@ -1445,6 +1452,29 @@ static int virtio_mem_unplug_request(struct virtio_mem 
*vm, uint64_t diff)
}
 
mutex_unlock(>hotplug_mutex);
+
+   /*
+* If we can't unplug the requested amount of memory for a long time,
+* start freeing up memory in caches. This might harm performance,
+* is configurable, and we log a message. Retry imemdiately a second
+* time - then wait another VIRTIO_MEM_RETRY_TIMER_MAX_MS.
+*/
+   if (nb_sb && !retried && drop_slab_objects &&
+   vm->retry_timer_ms == VIRTIO_MEM_RETRY_TIMER_MAX_MS) {
+   if (vm->nid == NUMA_NO_NODE) {
+   dev_info(>vdev->dev, "dropping all slab objects\n");
+   drop_slab();
+   } else {
+   dev_info(>vdev->dev,
+"dropping all slab objects on node=%d\n",
+vm->nid);
+   drop_slab_node(vm->nid);
+   }
+   retried = true;
+   mutex_lock(>hotplug_mutex);
+   goto retry_locked;
+   }
+
return nb_sb ? -EBUSY : 0;
 out_unlock:
mutex_unlock(>hotplug_mutex);
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 11/13] mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()

2019-12-12 Thread David Hildenbrand
Let's count within the function itself, so every invocation (of future
users) will be counted.

Cc: Alexander Viro 
Cc: Andrew Morton 
Cc: linux-fsde...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 fs/drop_caches.c | 4 +---
 mm/vmscan.c  | 1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d31b6c72b476..a042da782fcd 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -61,10 +61,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int 
write,
iterate_supers(drop_pagecache_sb, NULL);
count_vm_event(DROP_PAGECACHE);
}
-   if (sysctl_drop_caches & 2) {
+   if (sysctl_drop_caches & 2)
drop_slab();
-   count_vm_event(DROP_SLAB);
-   }
if (!stfu) {
pr_info("%s (%d): drop_caches: %d\n",
current->comm, task_pid_nr(current),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a6445e86328..c3e53502a84a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -726,6 +726,7 @@ void drop_slab(void)
 
for_each_online_node(nid)
drop_slab_node(nid);
+   count_vm_event(DROP_SLAB);
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 09/13] virtio-mem: Offline and remove completely unplugged memory blocks

2019-12-12 Thread David Hildenbrand
Let's offline+remove memory blocks once all subblocks are unplugged. We
can use the new Linux MM interface for that. As no memory is in use
anymore, this shouldn't take a long time and shouldn't fail. There might
be corner cases where the offlining could still fail (especially, if
another notifier NACKs the offlining request).

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 47 +
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a12a0f9c076b..807d4e393427 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -436,6 +436,28 @@ static int virtio_mem_mb_remove(struct virtio_mem *vm, 
unsigned long mb_id)
return remove_memory(nid, addr, memory_block_size_bytes());
 }
 
+/*
+ * Try to offline and remove a memory block from Linux.
+ *
+ * Must not be called with the vm->hotplug_mutex held (possible deadlock with
+ * onlining code).
+ *
+ * Will not modify the state of the memory block.
+ */
+static int virtio_mem_mb_offline_and_remove(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id);
+   int nid = vm->nid;
+
+   if (nid == NUMA_NO_NODE)
+   nid = memory_add_physaddr_to_nid(addr);
+
+   dev_dbg(>vdev->dev, "offlining and removing memory block: %lu\n",
+   mb_id);
+   return offline_and_remove_memory(nid, addr, memory_block_size_bytes());
+}
+
 /*
  * Trigger the workqueue so the device can perform its magic.
  */
@@ -529,7 +551,13 @@ static void virtio_mem_notify_offline(struct virtio_mem 
*vm,
break;
}
 
-   /* trigger the workqueue, maybe we can now unplug memory. */
+   /*
+* Trigger the workqueue, maybe we can now unplug memory. Also,
+* when we offline and remove a memory block, this will re-trigger
+* us immediately - which is often nice because the removal of
+* the memory block (e.g., memmap) might have freed up memory
+* on other memory blocks we manage.
+*/
virtio_mem_retry(vm);
 }
 
@@ -1275,7 +1303,8 @@ static int virtio_mem_mb_unplug_any_sb_offline(struct 
virtio_mem *vm,
  * Unplug the desired number of plugged subblocks of an online memory block.
  * Will skip subblock that are busy.
  *
- * Will modify the state of the memory block.
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
  *
  * Note: Can fail after some subblocks were successfully unplugged. Can
  *   return 0 even if subblocks were busy and could not get unplugged.
@@ -1331,9 +1360,19 @@ static int virtio_mem_mb_unplug_any_sb_online(struct 
virtio_mem *vm,
}
 
/*
-* TODO: Once all subblocks of a memory block were unplugged, we want
-* to offline the memory block and remove it.
+* Once all subblocks of a memory block were unplugged, offline and
+* remove it. This will usually not fail, as no memory is in use
+* anymore - however some other notifiers might NACK the request.
 */
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_offline_and_remove(vm, mb_id);
+   mutex_lock(>hotplug_mutex);
+   if (!rc)
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+   }
+
return 0;
 }
 
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 12/13] mm/vmscan: Export drop_slab() and drop_slab_node()

2019-12-12 Thread David Hildenbrand
We already have a way to trigger reclaiming of all reclaimable slab objects
from user space (echo 2 > /proc/sys/vm/drop_caches). Let's allow drivers
to also trigger this when they really want to make progress and know what
they are doing.

virtio-mem wants to use these functions when it failed to unplug memory
for quite some time (e.g., after 30 minutes). It will then try to
free up reclaimable objects by dropping the slab caches every now and
then (e.g., every 30 minutes) as long as necessary. There will be a way to
disable this feature and info messages will be logged.

In the future, we want to have a drop_slab_range() functionality
instead. Memory offlining code has similar demands and also other
alloc_contig_range() users (e.g., gigantic pages) could make good use of
this feature. Adding it, however, requires more work/thought.

Cc: Andrew Morton 
Cc: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 4 ++--
 mm/vmscan.c| 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64799c5cb39f..483300f58be8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2706,8 +2706,8 @@ int drop_caches_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 #endif
 
-void drop_slab(void);
-void drop_slab_node(int nid);
+extern void drop_slab(void);
+extern void drop_slab_node(int nid);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c3e53502a84a..4e1cdaaec5e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,6 +719,7 @@ void drop_slab_node(int nid)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
} while (freed > 10);
 }
+EXPORT_SYMBOL(drop_slab_node);
 
 void drop_slab(void)
 {
@@ -728,6 +729,7 @@ void drop_slab(void)
drop_slab_node(nid);
count_vm_event(DROP_SLAB);
 }
+EXPORT_SYMBOL(drop_slab);
 
 static inline int is_page_cache_freeable(struct page *page)
 {
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 10/13] virtio-mem: Better retry handling

2019-12-12 Thread David Hildenbrand
Let's start with a retry interval of 30 seconds and double the time until
we reach 30 minutes, in case we keep getting errors. Reset the retry
interval in case we succeeded.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 807d4e393427..3a57434f92ed 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -137,7 +137,9 @@ struct virtio_mem {
 
/* Timer for retrying to plug/unplug memory. */
struct hrtimer retry_timer;
-#define VIRTIO_MEM_RETRY_TIMER_MS  3
+   unsigned int retry_timer_ms;
+#define VIRTIO_MEM_RETRY_TIMER_MIN_MS  3
+#define VIRTIO_MEM_RETRY_TIMER_MAX_MS  180
 
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
@@ -1537,6 +1539,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 
switch (rc) {
case 0:
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
break;
case -ENOSPC:
/*
@@ -1552,8 +1555,7 @@ static void virtio_mem_run_wq(struct work_struct *work)
 */
case -ENOMEM:
/* Out of memory, try again later. */
-   hrtimer_start(>retry_timer,
- ms_to_ktime(VIRTIO_MEM_RETRY_TIMER_MS),
+   hrtimer_start(>retry_timer, ms_to_ktime(vm->retry_timer_ms),
  HRTIMER_MODE_REL);
break;
case -EAGAIN:
@@ -1573,6 +1575,9 @@ static enum hrtimer_restart 
virtio_mem_timer_expired(struct hrtimer *timer)
 retry_timer);
 
virtio_mem_retry(vm);
+   /* Racy (with reset in virtio_mem_run_wq), we ignore that for now. */
+   vm->retry_timer_ms = min_t(unsigned int, vm->retry_timer_ms * 2,
+  VIRTIO_MEM_RETRY_TIMER_MAX_MS);
return HRTIMER_NORESTART;
 }
 
@@ -1746,6 +1751,7 @@ static int virtio_mem_probe(struct virtio_device *vdev)
spin_lock_init(>removal_lock);
hrtimer_init(>retry_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
vm->retry_timer.function = virtio_mem_timer_expired;
+   vm->retry_timer_ms = VIRTIO_MEM_RETRY_TIMER_MIN_MS;
 
/* register the virtqueue */
rc = virtio_mem_init_vq(vm);
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 03/13] virtio-mem: Paravirtualized memory hotunplug part 1

2019-12-12 Thread David Hildenbrand
Unplugging subblocks of memory blocks that are offline is easy. All we
have to do is watch out for concurrent onlining acticity.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 109 +++-
 1 file changed, 107 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8f4ceeab3d7c..f1af05def5df 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -119,7 +119,7 @@ struct virtio_mem {
 *
 * When this lock is held the pointers can't change, ONLINE and
 * OFFLINE blocks can't change the state and no subblocks will get
-* plugged.
+* plugged/unplugged.
 */
struct mutex hotplug_mutex;
bool hotplug_active;
@@ -321,6 +321,19 @@ static bool virtio_mem_mb_test_sb_plugged(struct 
virtio_mem *vm,
   bit + count;
 }
 
+/*
+ * Test if all selected subblocks are unplugged.
+ */
+static bool virtio_mem_mb_test_sb_unplugged(struct virtio_mem *vm,
+   unsigned long mb_id, int sb_id,
+   int count)
+{
+   const int bit = (mb_id - vm->first_mb_id) * vm->nb_sb_per_mb + sb_id;
+
+   /* TODO: Helper similar to bitmap_set() */
+   return find_next_bit(vm->sb_bitmap, bit + count, bit) >= bit + count;
+}
+
 /*
  * Find the first plugged subblock. Returns vm->nb_sb_per_mb in case there is
  * none.
@@ -511,6 +524,9 @@ static void virtio_mem_notify_offline(struct virtio_mem *vm,
BUG();
break;
}
+
+   /* trigger the workqueue, maybe we can now unplug memory. */
+   virtio_mem_retry(vm);
 }
 
 static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long 
mb_id,
@@ -1123,6 +1139,93 @@ static int virtio_mem_plug_request(struct virtio_mem 
*vm, uint64_t diff)
return rc;
 }
 
+/*
+ * Unplug the desired number of plugged subblocks of an offline memory block.
+ * Will fail if any subblock cannot get unplugged (instead of skipping it).
+ *
+ * Will modify the state of the memory block. Might temporarily drop the
+ * hotplug_mutex.
+ *
+ * Note: Can fail after some subblocks were successfully unplugged.
+ */
+static int virtio_mem_mb_unplug_any_sb_offline(struct virtio_mem *vm,
+  unsigned long mb_id,
+  uint64_t *nb_sb)
+{
+   int rc;
+
+   rc = virtio_mem_mb_unplug_any_sb(vm, mb_id, nb_sb);
+
+   /* some subblocks might have been unplugged even on failure */
+   if (!virtio_mem_mb_test_sb_plugged(vm, mb_id, 0, vm->nb_sb_per_mb))
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL);
+   if (rc)
+   return rc;
+
+   if (virtio_mem_mb_test_sb_unplugged(vm, mb_id, 0, vm->nb_sb_per_mb)) {
+   /*
+* Remove the block from Linux - this should never fail.
+* Hinder the block from getting onlined by marking it
+* unplugged. Temporarily drop the mutex, so
+* any pending GOING_ONLINE requests can be serviced/rejected.
+*/
+   virtio_mem_mb_set_state(vm, mb_id,
+   VIRTIO_MEM_MB_STATE_UNUSED);
+
+   mutex_unlock(>hotplug_mutex);
+   rc = virtio_mem_mb_remove(vm, mb_id);
+   BUG_ON(rc);
+   mutex_lock(>hotplug_mutex);
+   }
+   return 0;
+}
+
+/*
+ * Try to unplug the requested amount of memory.
+ */
+static int virtio_mem_unplug_request(struct virtio_mem *vm, uint64_t diff)
+{
+   uint64_t nb_sb = diff / vm->subblock_size;
+   unsigned long mb_id;
+   int rc;
+
+   if (!nb_sb)
+   return 0;
+
+   /*
+* We'll drop the mutex a couple of times when it is safe to do so.
+* This might result in some blocks switching the state (online/offline)
+* and we could miss them in this run - we will retry again later.
+*/
+   mutex_lock(>hotplug_mutex);
+
+   /* Try to unplug subblocks of partially plugged offline blocks. */
+   virtio_mem_for_each_mb_state(vm, mb_id,
+VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL) {
+   rc = virtio_mem_mb_unplug_any_sb_offline(vm, mb_id,
+_sb);
+   if (rc || !nb_sb)
+   goto out_unlock;
+   cond_resched();
+   }
+
+   /* Try to unplug subblocks of plugged offline blocks. */
+   virtio_mem_for_each_mb_state(vm, mb_id, VIRTIO_MEM_MB_STATE_OFFLINE) {
+

[virtio-dev] [PATCH RFC v4 02/13] virtio-mem: Paravirtualized memory hotplug

2019-12-12 Thread David Hildenbrand
Each virtio-mem device owns exactly one memory region. It is responsible
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- Memory blocks that are partally unplugged must not be onlined to the
  MOVABLE_ZONE. They contain unmovable parts and might e.g., result
  in warnings when trying to offline them.
- In the kdump kernel, we don't want to load the driver, to not mess
  with memory to be dumped.
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.
- When unloading the driver, we have to remove partially plugged offline
  memory blocks. Otherwise they could get fully onlined later on, when
  we no longer have control over via the online_page callback.
- As the hypervisor might suddenly be busy (during reboot, in-between
  requests, when adding of memory fails), we have to take care of some
  corner cases - especially virtio_mem_unplug_pending_mb() and
  virtio_mem_send_unplug_all_request(). The hypervisor could for example
  be busy if it is currently migrating the guest.

Future work:
- Reduce the amount of memory resources if that turnes out to be an
  issue. Or try to speed up relevant code paths to deal with many
  resources.
- Allocate the vmemmap from the added memory. Makes hotplug more likely
  to succeed, the vmemmap is stored on the same NUMA node and that
  unmovable memory will later not hinder unplug.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   17 +
 drivers/virtio/Makefile |1 +
 drivers/virtio/virtio_mem.c | 1569 +++
 include/uapi/linux/virtio_ids.h |1 +
 include/uapi/linux/virtio_mem.h |  204 
 5 files changed, 1792 insertions(+)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..294720d53057 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -64,6 +64,23 @@ config VIRTIO_BALLOON
 
 If unsure, say M.
 
+config VIRTIO_MEM
+   tristate "Virtio mem driver"
+   default m
+   depends on X86_64
+   depends on VIRTIO
+   depends on MEMORY_HOTPLUG_SPARSE
+   depends on MEMORY_HOTREMOVE
+   help
+This driver provides access to virtio-mem paravirtualized memory
+devices, allowing to hotplug and hotunplug memory.
+
+This driver is was only tested under x86-64, but should
+theoretically work on all architectures that support memory
+hotplug and hotremove.
+
+If unsure, say M.
+
 config VIRTIO_INPUT
tristate "Virtio input driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..906d5a00ac85 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
diff --git a/drivers/virtio/virtio_mem.c 

[virtio-dev] [PATCH RFC v4 07/13] virtio-mem: Allow to offline partially unplugged memory blocks

2019-12-12 Thread David Hildenbrand
Dropping the reference count of PageOffline() pages allows offlining
code to skip them. However, we also have to convert PG_reserved to
another flag - let's use PG_dirty - so has_unmovable_pages() will
properly handle them. PG_reserved pages get detected as unmovable right
away.

We need the flag to see if we are onlining pages the first time, or if
we allocated them via alloc_contig_range().

Properly take care of offlining code also modifying the stats and
special handling in case the driver gets unloaded.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c | 64 -
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 5a142a371222..a12a0f9c076b 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -564,6 +564,53 @@ static void virtio_mem_notify_online(struct virtio_mem 
*vm, unsigned long mb_id,
virtio_mem_retry(vm);
 }
 
+static void virtio_mem_notify_going_offline(struct virtio_mem *vm,
+   unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Drop our reference to the pages so the memory can get
+* offlined and add the unplugged pages to the managed
+* page counters (so offlining code can correctly subtract
+* them again).
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), nr_pages);
+   for (i = 0; i < nr_pages; i++)
+   page_ref_dec(pfn_to_page(pfn + i));
+   }
+}
+
+static void virtio_mem_notify_cancel_offline(struct virtio_mem *vm,
+unsigned long mb_id)
+{
+   const unsigned long nr_pages = PFN_DOWN(vm->subblock_size);
+   unsigned long pfn;
+   int sb_id, i;
+
+   for (sb_id = 0; sb_id < vm->nb_sb_per_mb; sb_id++) {
+   if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1))
+   continue;
+   /*
+* Get the reference we dropped when going offline and
+* subtract the unplugged pages from the managed page
+* counters.
+*/
+   pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+  sb_id * vm->subblock_size);
+   adjust_managed_page_count(pfn_to_page(pfn), -nr_pages);
+   for (i = 0; i < nr_pages; i++)
+   page_ref_inc(pfn_to_page(pfn + i));
+   }
+}
+
 /*
  * This callback will either be called synchonously from add_memory() or
  * asynchronously (e.g., triggered via user space). We have to be careful
@@ -611,6 +658,7 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
break;
mutex_lock(>hotplug_mutex);
vm->hotplug_active = true;
+   virtio_mem_notify_going_offline(vm, mb_id);
break;
case MEM_GOING_ONLINE:
spin_lock_irq(>removal_lock);
@@ -636,6 +684,12 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
mutex_unlock(>hotplug_mutex);
break;
case MEM_CANCEL_OFFLINE:
+   if (!vm->hotplug_active)
+   break;
+   virtio_mem_notify_cancel_offline(vm, mb_id);
+   vm->hotplug_active = false;
+   mutex_unlock(>hotplug_mutex);
+   break;
case MEM_CANCEL_ONLINE:
if (!vm->hotplug_active)
break;
@@ -660,8 +714,11 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
struct page *page = pfn_to_page(pfn);
 
__SetPageOffline(page);
-   if (!onlined)
+   if (!onlined) {
SetPageDirty(page);
+   /* FIXME: remove after cleanups */
+   ClearPageReserved(page);
+   }
}
 }
 
@@ -1719,6 +1776,11 @@ static void virtio_mem_remove(struct virtio_device *vdev)
rc = virtio_mem_mb_remove(vm, mb_id);
BUG_ON(rc);
}
+   /*
+* After we unregistered our callbacks, user space can no longer
+* offline partially plugged online memory blocks. No 

[virtio-dev] [PATCH RFC v4 06/13] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2019-12-12 Thread David Hildenbrand
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observerable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
the memory block has to be handled by hooking into onlining code
(online_page_callback_t), resetting the page PageOffline() and
not giving them to the buddy.

Cc: Andrew Morton 
Cc: Juergen Gross 
Cc: Konrad Rzeszutek Wilk 
Cc: Pavel Tatashin 
Cc: Alexander Duyck 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Qian Cai 
Cc: Pingfan Liu 
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 10 ++
 mm/memory_hotplug.c| 41 --
 mm/page_alloc.c| 24 ++
 mm/page_isolation.c|  9 +
 4 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..ac1775082343 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -761,6 +761,16 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * not onlined when onlining the section).
  * The content of these pages is effectively stale. Such pages should not
  * be touched (read/write/dump/save) except by their owner.
+ *
+ * If a driver wants to allow to offline unmovable PageOffline() pages without
+ * putting them back to the buddy, it can do so via the memory notifier by
+ * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
+ * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
+ * pages (now with a reference count of zero) are treated like free pages,
+ * allowing the containing memory block to get offlined. A driver that
+ * relies on this feature is aware that re-onlining the memory block will
+ * require to re-set the pages PageOffline() and not giving them to the
+ * buddy via online_page_callback_t.
  */
 PAGE_TYPE_OPS(Offline, offline)
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc617ad6f035..da01453a04e6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1231,11 +1231,15 @@ int test_pages_in_a_zone(unsigned long start_pfn, 
unsigned long end_pfn,
 
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
- * non-lru movable pages and hugepages). We scan pfn because it's much
- * easier than scanning over linked list. This function returns the pfn
- * of the first found movable page if it's found, otherwise 0.
+ * non-lru movable pages and hugepages).
+ *
+ * Returns:
+ * 0 in case a movable page is found and movable_pfn was updated.
+ * -ENOENT in case no movable page was found.
+ * -EBUSY in case a definetly unmovable page was found.
  */
-static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
+static int scan_movable_pages(unsigned long start, unsigned long end,
+ unsigned long *movable_pfn)
 {
unsigned long pfn;
 
@@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long 
start, unsigned long end)
continue;
page = pfn_to_page(pfn);
if (PageLRU(page))
-   return pfn;
+   goto found;
if (__PageMovable(page))
-   return pfn;
+   goto found;
+
+   /*
+* Unmovable PageOffline() pages where somebody still holds
+* a reference count (after 

[virtio-dev] [PATCH RFC v4 04/13] mm: Export alloc_contig_range() / free_contig_range()

2019-12-12 Thread David Hildenbrand
A virtio-mem device wants to allocate memory from the memory region it
manages in order to unplug it in the hypervisor - similar to
a balloon driver. Also, it might want to plug previously unplugged
(allocated) memory and give it back to Linux. alloc_contig_range() /
free_contig_range() seem to be the perfect interface for this task.

In contrast to existing balloon devices, a virtio-mem device operates
on bigger chunks (e.g., 4MB) and only on physical memory it manages. It
tracks which chunks (subblocks) are still plugged, so it can go ahead
and try to alloc_contig_range()+unplug them on unplug request, or
plug+free_contig_range() unplugged chunks on plug requests.

A virtio-mem device will use alloc_contig_range() / free_contig_range()
only on ranges that belong to the same node/zone in at least
MAX(MAX_ORDER - 1, pageblock_order) order granularity - e.g., 4MB on
x86-64. The virtio-mem device added that memory, so the memory
exists and does not contain any holes. virtio-mem will only try to allocate
on ZONE_NORMAL, never on ZONE_MOVABLE, just like when allocating
gigantic pages (we don't put unmovable data into the movable zone).

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Alexander Potapenko 
Acked-by: Michal Hocko  # to export contig range allocator API
Signed-off-by: David Hildenbrand 
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62dcd6b76c80..5334decc9e06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8494,6 +8494,7 @@ int alloc_contig_range(unsigned long start, unsigned long 
end,
pfn_max_align_up(end), migratetype);
return ret;
 }
+EXPORT_SYMBOL(alloc_contig_range);
 
 static int __alloc_contig_pages(unsigned long start_pfn,
unsigned long nr_pages, gfp_t gfp_mask)
@@ -8609,6 +8610,7 @@ void free_contig_range(unsigned long pfn, unsigned int 
nr_pages)
}
WARN(count != 0, "%d pages are still in use!\n", count);
 }
+EXPORT_SYMBOL(free_contig_range);
 
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 05/13] virtio-mem: Paravirtualized memory hotunplug part 2

2019-12-12 Thread David Hildenbrand
We can use alloc_contig_range() to try to unplug subblocks. Unplugged
blocks will be marked PG_offline, however, don't have the PG_reserved
flag set. This way, we can differentiate these allocated subblocks from
subblocks that were never onlined and handle them properly in
virtio_mem_fake_online(). free_contig_range() is used to hand back
subblocks to Linux.

It is worth noting that there are no guarantess on how much memory can
actually get unplugged again. All device memory might completely be
fragmented with unmovable data, such that no subblock can get unplugged.
We might want to improve the unplugging capability in the future.

We are not touching the ZONE_MOVABLE. If memory is onlined to the
ZONE_MOVABLE, it can only get unplugged after that memory was offlined
manually by user space. In normal operation, virtio-mem memory is
suggested to be onlined to ZONE_NORMAL. In the future, we will try to
make unplug more likely to succeed.

Add a module parameter to control if online memory shall be touched.

Future work:
- Performance improvements:
-- Sense (lockless) if it make sense to try alloc_contig_range() at all
   before directly trying to isolate and taking locks.
-- Try to unplug bigger chunks if possible first.
-- Identify free areas first, that don't have to be evacuated.
- Make unplug more likely to succeed:
-- There are various idea to limit fragmentation on memory block
   granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
-- Allocate memmap from added memory. This way, less unmovable data can
   end up on the memory blocks.
- OOM handling, e.g., via an OOM handler.
- Defragmentation
-- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Igor Mammedov 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Stefan Hajnoczi 
Cc: Vlastimil Babka 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/Kconfig  |   1 +
 drivers/virtio/virtio_mem.c | 156 
 2 files changed, 143 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 294720d53057..75a760f32ec7 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -71,6 +71,7 @@ config VIRTIO_MEM
depends on VIRTIO
depends on MEMORY_HOTPLUG_SPARSE
depends on MEMORY_HOTREMOVE
+   select CONTIG_ALLOC
help
 This driver provides access to virtio-mem paravirtualized memory
 devices, allowing to hotplug and hotunplug memory.
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index f1af05def5df..5a142a371222 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,10 @@
 
 #include 
 
+static bool unplug_online = true;
+module_param(unplug_online, bool, 0644);
+MODULE_PARM_DESC(unplug_online, "Try to unplug online memory");
+
 enum virtio_mem_mb_state {
/* Unplugged, not added to Linux. Can be reused later. */
VIRTIO_MEM_MB_STATE_UNUSED = 0,
@@ -646,23 +650,35 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
 }
 
 /*
- * Set a range of pages PG_offline.
+ * Set a range of pages PG_offline. Remember pages that were never onlined
+ * (via generic_online_page()) using PageDirty().
  */
 static void virtio_mem_set_fake_offline(unsigned long pfn,
-   unsigned int nr_pages)
+   unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __SetPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __SetPageOffline(page);
+   if (!onlined)
+   SetPageDirty(page);
+   }
 }
 
 /*
- * Clear PG_offline from a range of pages.
+ * Clear PG_offline from a range of pages. If the pages were never onlined,
+ * (via generic_online_page()), clear PageDirty().
  */
 static void virtio_mem_clear_fake_offline(unsigned long pfn,
- unsigned int nr_pages)
+ unsigned int nr_pages, bool onlined)
 {
-   for (; nr_pages--; pfn++)
-   __ClearPageOffline(pfn_to_page(pfn));
+   for (; nr_pages--; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   __ClearPageOffline(page);
+   if (!onlined)
+   ClearPageDirty(page);
+   }
 }
 
 /*
@@ -681,10 +697,26 @@ static void virtio_mem_fake_online(unsigned long pfn, 
unsigned int nr_pages)
BUG_ON(!IS_ALIGNED(pfn, 1 << order));
BUG_ON(!IS_ALIGNED(nr_pages, 1 << order));
 
-   virtio_mem_clear_fake_offline(pfn, nr_pages);
+   for (i = 0; i < nr_pages; i += 1 << order) {
+   struct page *page = pfn_to_page(pfn + i);
 
-   for (i = 0; i < nr_pages; i += 1 << order)
- 

[virtio-dev] [PATCH RFC v4 01/13] ACPI: NUMA: export pxm_to_node

2019-12-12 Thread David Hildenbrand
Will be needed by virtio-mem to identify the node from a pxm.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-a...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 drivers/acpi/numa/srat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index eadbf90e65d1..d5847fa7ac69 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -35,6 +35,7 @@ int pxm_to_node(int pxm)
return NUMA_NO_NODE;
return pxm_to_node_map[pxm];
 }
+EXPORT_SYMBOL(pxm_to_node);
 
 int node_to_pxm(int node)
 {
-- 
2.23.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2019-12-12 Thread David Hildenbrand
This series is based on latest linux-next. The patches are located at:
https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4

The basic idea of virtio-mem is to provide a flexible,
cross-architecture memory hot(un)plug solution that avoids many limitations
imposed by existing technologies, architectures, and interfaces. More
details can be found below and in linked material.

This RFC is limited to x86-64, however, should theoretically work on any
architecture that supports virtio and implements memory hot(un)plug under
Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
possible to add/remove memory to the system in >= 4MB granularity.
Memory hotplug works very reliably. For memory unplug, there are no
guarantees how much memory can actually get unplugged, it depends on the
setup (especially: fragmentation of (unmovable) memory). I have plans to
improve that in the future.

--
1. virtio-mem
--

The basic idea behind virtio-mem was presented at KVM Forum 2018. The
slides can be found at [1]. The previous RFC can be found at [2]. The
first RFC can be found at [3]. However, the concept evolved over time. The
KVM Forum slides roughly match the current design.

Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
information, especially in "include/uapi/linux/virtio_mem.h":

  Each virtio-mem device manages a dedicated region in physical address
  space. Each device can belong to a single NUMA node, multiple devices
  for a single NUMA node are possible. A virtio-mem device is like a
  "resizable DIMM" consisting of small memory blocks that can be plugged
  or unplugged. The device driver is responsible for (un)plugging memory
  blocks on demand.

  Virtio-mem devices can only operate on their assigned memory region in
  order to (un)plug memory. A device cannot (un)plug memory belonging to
  other devices.

  The "region_size" corresponds to the maximum amount of memory that can
  be provided by a device. The "size" corresponds to the amount of memory
  that is currently plugged. "requested_size" corresponds to a request
  from the device to the device driver to (un)plug blocks. The
  device driver should try to (un)plug blocks in order to reach the
  "requested_size". It is impossible to plug more memory than requested.

  The "usable_region_size" represents the memory region that can actually
  be used to (un)plug memory. It is always at least as big as the
  "requested_size" and will grow dynamically. It will only shrink when
  explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).

  Memory in the usable region can usually be read, however, there are no
  guarantees. It can happen that the device cannot process a request,
  because it is busy. The device driver has to retry later.

  Usually, during system resets all memory will get unplugged, so the
  device driver can start with a clean state. However, in specific
  scenarios (if the device is busy) it can happen that the device still
  has memory plugged. The device driver can request to unplug all memory
  (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
  device is busy.

--
2. Linux Implementation
--

This RFC reuses quite some existing MM infrastructure, however, has to
expose some additional functionality.

Memory blocks (e.g., 128MB) are added/removed on demand. Within these
memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
depend on the target architecture, MAX_ORDER + pageblock_order, and
the block size of a virtio-mem device.

add_memory()/try_remove_memory() is used to add/remove memory blocks.
virtio-mem will not online memory blocks itself. This has to be done by
user space, or configured into the kernel
(CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
the ZONE_NORMAL for now.

The memory hotplug notifier is used to properly synchronize against
onlining/offlining of memory blocks and to track the states of memory
blocks (including the zone memory blocks are onlined to).

The set_online_page() callback is used to keep unplugged subblocks
of a memory block fake-offline when onlining the memory block.
generic_online_page() is used to fake-online plugged subblocks. This
handling is similar to the Hyper-V balloon driver.

PG_offline is used to mark unplugged subblocks as offline, so e.g.,
dumping tools (makedumpfile) will skip these pages. This is similar to
other balloon drivers like virtio-balloon and Hyper-V.

Memory offlining code is extended to allow drivers to drop their reference
to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
when offlining memory 

Re: [virtio-dev] Re: guest / host buffer sharing ...

2019-12-12 Thread Gerd Hoffmann
On Thu, Dec 12, 2019 at 09:26:32PM +0900, David Stevens wrote:
> > > > Second I think it is a bad idea
> > > > from the security point of view.  When explicitly exporting buffers it
> > > > is easy to restrict access to the actual exports.
> > >
> > > Restricting access to actual exports could perhaps help catch bugs.
> > > However, I don't think it provides any security guarantees, since the
> > > guest can always just export every buffer before using it.
> >
> > Probably not on the guest/host boundary.
> >
> > It's important for security inside the guest though.  You don't want
> > process A being able to access process B private resources via buffer
> > sharing support, by guessing implicit buffer identifiers.
> 
> At least for the linux guest implementation, I wouldn't think the
> uuids would be exposed from the kernel. To me, it seems like something
> that should be handled internally by the virtio drivers.

That would be one possible use case, yes.  The exporting driver attaches
a uuid to the dma-buf.  The importing driver can see the attached uuid
and use it (if supported, otherwise run with the scatter list).  That
will be transparent to userspace, apps will just export/import dma-bufs
as usual and not even notice the uuid.

I can see other valid use cases though:  A wayland proxy could use
virtio-gpu buffer exports for shared memory and send the buffer uuid
to the host over some stream protocol (vsock, tcp, ...).  For that to
work we have to export the uuid to userspace, for example using a ioctl
on the dma-buf file handle.

> If you use some other guest with untrusted
> userspace drivers, or if you're pulling the uuids out of the kernel to
> give to some non-virtio transport, then I can see it being a concern.

I strongly prefer a design where we don't have to worry about that
concern in the first place instead of discussing whenever we should be
worried or not.

> > Without buffer sharing support the driver importing a virtio-gpu dma-buf
> > can send the buffer scatter list to the host.  So both virtio-gpu and
> > the other device would actually access the same guest pages, but they
> > are not aware that the buffer is shared between devices.
> 
> With the uuid approach, how should this case be handled? Should it be
> equivalent to exporting and importing the buffer which was created
> first? Should the spec say it's undefined behavior that might work as
> expected but might not, depending on the device implementation? Does
> the spec even need to say anything about it?

Using the uuid is an optional optimization.  I'd expect the workflow be
roughly this:

  (1) exporting driver exports a dma-buf as usual, additionally attaches
  a uuid to it and notifies the host (using device-specific commands).
  (2) importing driver will ask the host to use the buffer referenced by
  the given uuid.
  (3) if (2) fails for some reason use the dma-buf scatter list instead.

Of course only virtio drivers would try step (2), other drivers (when
sharing buffers between intel gvt device and virtio-gpu for example)
would go straight to (3).

> > With buffer sharing virtio-gpu would attach a uuid to the dma-buf, and
> > the importing driver can send the uuid (instead of the scatter list) to
> > the host.  So the device can simply lookup the buffer on the host side
> > and use it directly.  Another advantage is that this enables some more
> > use cases like sharing buffers between devices which are not backed by
> > guest ram.
> 
> Not just buffers not backed by guest ram, but things like fences. I
> would suggest the uuids represent 'exported resources' rather than
> 'exported buffers'.

Hmm, I can't see how this is useful.  Care to outline how you envision
this to work in a typical use case?

cheers,
  Gerd


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: guest / host buffer sharing ...

2019-12-12 Thread David Stevens
> > > Second I think it is a bad idea
> > > from the security point of view.  When explicitly exporting buffers it
> > > is easy to restrict access to the actual exports.
> >
> > Restricting access to actual exports could perhaps help catch bugs.
> > However, I don't think it provides any security guarantees, since the
> > guest can always just export every buffer before using it.
>
> Probably not on the guest/host boundary.
>
> It's important for security inside the guest though.  You don't want
> process A being able to access process B private resources via buffer
> sharing support, by guessing implicit buffer identifiers.

At least for the linux guest implementation, I wouldn't think the
uuids would be exposed from the kernel. To me, it seems like something
that should be handled internally by the virtio drivers. Especially
since the 'export' process would be very much a virtio-specific
action, so it's likely that it wouldn't fit nicely into existing
userspace software. If you use some other guest with untrusted
userspace drivers, or if you're pulling the uuids out of the kernel to
give to some non-virtio transport, then I can see it being a concern.

> > > Instead of using a dedicated buffer sharing device we can also use
> > > virtio-gpu (or any other driver which supports dma-buf exports) to
> > > manage buffers.

Ah, okay. I misunderstood the original statement. I read the sentence
as 'we can use virtio-gpu in place of the dedicated buffer sharing
device', rather than 'every device can manage its own buffers'. I can
agree with the second meaning.

> Without buffer sharing support the driver importing a virtio-gpu dma-buf
> can send the buffer scatter list to the host.  So both virtio-gpu and
> the other device would actually access the same guest pages, but they
> are not aware that the buffer is shared between devices.

With the uuid approach, how should this case be handled? Should it be
equivalent to exporting and importing the buffer which was created
first? Should the spec say it's undefined behavior that might work as
expected but might not, depending on the device implementation? Does
the spec even need to say anything about it?

> With buffer sharing virtio-gpu would attach a uuid to the dma-buf, and
> the importing driver can send the uuid (instead of the scatter list) to
> the host.  So the device can simply lookup the buffer on the host side
> and use it directly.  Another advantage is that this enables some more
> use cases like sharing buffers between devices which are not backed by
> guest ram.

Not just buffers not backed by guest ram, but things like fences. I
would suggest the uuids represent 'exported resources' rather than
'exported buffers'.

> Well, security-wise you want have buffer identifiers which can't be
> easily guessed.  And guessing uuid is pretty much impossible due to
> the namespace being huge.

I guess this depends on what you're passing around within the guest.
If you're passing around the raw uuids, sure. But I would argue it's
better to pass around unforgeable identifiers (e.g. fds), and to
restrict the uuids to when talking directly to the virtio transport.
But I guess there are likely situations where that's not possible.

-David

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [RFC RESEND] virtio-video: Add virtio video device specification

2019-12-12 Thread Dmitry Sepp
Hi Keiichi,

Thank you for your comment.
What do you thing about selection/crop rectangles? Should this be defined as a 
must have? Or we just assume the device always sticks to the stream 
resolution.

Regards,
Dmitry.

On Donnerstag, 12. Dezember 2019 06:39:11 CET Keiichi Watanabe wrote:
> On Tue, Dec 10, 2019 at 10:16 PM Dmitry Sepp
> 
>  wrote:
> > Hi,
> > 
> > Just to start, let's consider this v4l2 control:
> > V4L2_CID_MPEG_VIDEO_FRAME_RC_ENABLE.
> > As I can see, this control is referenced as a mandatory one in the
> > Chromium
> > sources [1].
> > 
> > So could someone from the Chromium team please explain why it is
> > mandatory?
> > (YouTube?) In fact, almost no encoders implement this control. Do we need
> > it in virtio-video?
> 
> That control is used to encode videos in constant bitrate (CBR) mode,
> which is critical for
> real-time use case like video conferencing.
> 
> However, that Chromium source code just requires *host-side* drivers
> to have the v4l2
> control. Also, I guess it's rare that a guest app wants to use CQP
> instead of CBR from our
> experience.
> So, I suppose we can omit this control in virtio spec for now by
> assuming that guests
> always use CBR mode.
> 
> Best regards,
> Keiichi
> 
> > [1]
> > https://chromium.googlesource.com/chromium/src/media/+/refs/heads/master/
> > gpu/v4l2/v4l2_video_encode_accelerator.cc#1500
> > 
> > Regards,
> > Dmitry.
> > 
> > On Montag, 9. Dezember 2019 22:12:28 CET Enrico Granata wrote:
> > > +Changyeon Jo  for his awareness
> > > 
> > > Thanks,
> > > - Enrico
> > > 
> > > 
> > > On Mon, Dec 9, 2019 at 6:20 AM Dmitry Sepp 
> > > 
> > > wrote:
> > > > Hello,
> > > > 
> > > > I'd like to invite everyone to share ideas regarding required encoder
> > > > features
> > > > in this separate sub-tree.
> > > > 
> > > > In general, encoder devices are more complex compared to decoders. So
> > > > the
> > > > question I'd like to rise is in what way we define the minimal subset
> > > > of
> > > > features to be implemented by the virtio-video.
> > > > 
> > > > We may look at the following to define the set of features:
> > > > 1. USB video, 2.3.6 Encoding Unit [1].
> > > > 2. Android 10 Compatibility Definition [2].
> > > > 
> > > > Would be nice to hear about any specific requirements from the
> > > > Chromium
> > > > team as
> > > > well.
> > > > 
> > > > [1] https://www.usb.org/sites/default/files/USB_Video_Class_1_5.zip
> > > > [2]
> > > > https://source.android.com/compatibility/android-cdd#5_2_video_encodin
> > > > g
> > > > 
> > > > Thank you.
> > > > 
> > > > Best regards,
> > > > Dmitry.
> > > > 
> > > > On Mittwoch, 4. Dezember 2019 10:16:20 CET Gerd Hoffmann wrote:
> > > > >   Hi,
> > > > >   
> > > > > > 1. Focus on only decoder/encoder functionalities first.
> > > > > > 
> > > > > > As Tomasz said earlier in this thread, it'd be too complicated to
> > > > 
> > > > support
> > > > 
> > > > > > camera usage at the same time. So, I'd suggest to make it just a
> > > > 
> > > > generic
> > > > 
> > > > > > mem-to-mem video processing device protocol for now.
> > > > > > If we finally decide to support camera in this protocol, we can
> > > > > > add it
> > > > > > later.
> > > > > 
> > > > > Agree.
> > > > > 
> > > > > > 2. Only one feature bit can be specified for one device.
> > > > > > 
> > > > > > I'd like to have a decoder device and encoder device separately.
> > > > > > It'd be natural to assume it because a decoder and an encoder are
> > > > 
> > > > provided
> > > > 
> > > > > > as different hardware.
> > > > > 
> > > > > Hmm, modern GPUs support both encoding and decoding ...
> > > > > 
> > > > > I don't think we should bake that restriction into the
> > > > > specification.
> > > > > It probably makes sense to use one virtqueue per function though,
> > > > > that
> > > > > will simplify dispatching in both host and guest.
> > > > > 
> > > > > > 3. Separate buffer allocation functionalities from virtio-video
> > > > 
> > > > protocol.
> > > > 
> > > > > > To support various ways of guest/host buffer sharing, we might
> > > > > > want to
> > > > > > have a dedicated buffer sharing device as we're discussing in
> > > > > > another
> > > > > > thread. Until we reach consensus there, it'd be good not to have
> > > > > > buffer
> > > > > > allocation
> > > > > > functionalities in virtio-video.
> > > > > 
> > > > > I think virtio-video should be able to work as stand-alone device,
> > > > > so we need some way to allocate buffers ...
> > > > > 
> > > > > Buffer sharing with other devices can be added later.
> > > > > 
> > > > > > > +The virtio video device is a virtual video streaming device
> > > > > > > that
> > > > > > > supports the +following functions: encoder, decoder, capture,
> > > > > > > output.
> > > > > > > +
> > > > > > > +\subsection{Device ID}\label{sec:Device Types / Video Device /
> > > > 
> > > > Device
> > > > 
> > > > > > > ID}
> > > > > > > +
> > > > > > > +TBD.
> > > > > > 
> > > > > > I'm wondering how and when 

Re: [virtio-dev] Re: guest / host buffer sharing ...

2019-12-12 Thread Gerd Hoffmann
  Hi,

> > First the addressing is non-trivial, especially with the "transport
> > specific device address" in the tuple.
> 
> There is complexity here, but I think it would also be present in the
> buffer sharing device case. With a buffer sharing device, the same
> identifying information would need to be provided from the exporting
> driver to the buffer sharing driver, so the buffer sharing device
> would be able to identify the right device in the vmm.

No.  The idea is that the buffer sharing device will allocate and manage
the buffers (including identifiers), i.e. it will only export buffers,
never import.

> > Second I think it is a bad idea
> > from the security point of view.  When explicitly exporting buffers it
> > is easy to restrict access to the actual exports.
> 
> Restricting access to actual exports could perhaps help catch bugs.
> However, I don't think it provides any security guarantees, since the
> guest can always just export every buffer before using it.

Probably not on the guest/host boundary.

It's important for security inside the guest though.  You don't want
process A being able to access process B private resources via buffer
sharing support, by guessing implicit buffer identifiers.

With explicit buffer exports that opportunity doesn't exist in the first
place.  Anything not exported can't be accessed via buffer sharing,
period.  And to access the exported buffers you need to know the uuid,
which in turn allows the guest implement any access restrictions it
wants.

> > Instead of using a dedicated buffer sharing device we can also use
> > virtio-gpu (or any other driver which supports dma-buf exports) to
> > manage buffers.
> 
> I don't think adding generic buffer management to virtio-gpu (or any
> specific device type) is a good idea,

There isn't much to add btw.  virtio-gpu has buffer management, buffers
are called "resources" in virtio-gpu terminology.  You can already
export them as dma-bufs (just landed in 5.5-rc1) and import them into
other drivers.

Without buffer sharing support the driver importing a virtio-gpu dma-buf
can send the buffer scatter list to the host.  So both virtio-gpu and
the other device would actually access the same guest pages, but they
are not aware that the buffer is shared between devices.

With buffer sharing virtio-gpu would attach a uuid to the dma-buf, and
the importing driver can send the uuid (instead of the scatter list) to
the host.  So the device can simply lookup the buffer on the host side
and use it directly.  Another advantage is that this enables some more
use cases like sharing buffers between devices which are not backed by
guest ram.

> since that device would then
> become a requirement for buffer sharing between unrelated devices.

No.  When we drop the buffer sharing device idea (which is quite
likely), then any device can create buffers.  If virtio-gpu is involved
anyway, for example because you want show the images from the
virtio-camera device on the virtio-gpu display, it makes sense to use
virtio-gpu of course.  But any other device can create and export
buffers in a similar way.  Without a buffer sharing device there is no
central instance managing the buffers.  A virtio-video spec (video
encoder/decoder) is in discussion at the moment, it will probably get
resource management simliar to virtio-gpu for the video frames, and it
will be able to export/import those buffers (probably not in the first
revision, but it is on the radar).

> > With no central instance (buffer sharing device) being there managing
> > the buffer identifiers I think using uuids as identifiers would be a
> > good idea, to avoid clashes.  Also good for security because it's pretty
> > much impossible to guess buffer identifiers then.
> 
> Using uuids to identify buffers would work. The fact that it provides
> a single way to refer to both guest and host allocated buffers is
> nice. And it could also directly apply to sharing resources other than
> buffers (e.g. fences). Although unless we're positing that there are
> different levels of trust within the guest, I don't think uuids really
> provides much security.

Well, security-wise you want have buffer identifiers which can't be
easily guessed.  And guessing uuid is pretty much impossible due to
the namespace being huge.

> If we're talking about uuids, they could also be used to simplify my
> proposed implicit addressing scheme. Each device could be assigned a
> uuid, which would simplify the shared resource identifier to
> (device-uuid, shmid, offset).

See above for the security aspects of implicit vs. explicit buffer
identifiers.

cheers,
  Gerd


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org