Re: [Xen-devel] [PATCH v2] xen/grant-table: correctly initialize grant table version 1
Thanks so much for posting this. On 2013-1-6 19:14, Matt Wilson wrote: Commit 85ff6acb075a484780b3d763fdf41596d8fc0970 (xen/granttable: Grant tables V2 implementation) changed the GREFS_PER_GRANT_FRAME macro from a constant to a conditional expression. The expression depends on grant_table_version being appropriately set. Unfortunately, at init time grant_table_version will be 0. The GREFS_PER_GRANT_FRAME conditional expression checks for "grant_table_version == 1", and therefore returns the number of grant references per frame for v2. This causes gnttab_init() to allocate fewer pages for gnttab_list, as a frame can old half the number of v2 entries than v1 entries. After gnttab_resume() is called, grant_table_version is appropriately set. nr_init_grefs will then be miscalculated and gnttab_free_count will hold a value larger than the actual number of free gref entries. If a guest is heavily utilizing improperly initialized v1 grant tables, memory corruption can occur. One common manifestation is corruption of the vmalloc list, resulting in a poisoned pointer derefrence when accessing /proc/meminfo or /proc/vmallocinfo: [ 40.770064] BUG: unable to handle kernel paging request at 20021407 [ 40.770083] IP: [] get_vmalloc_info+0x70/0x110 [ 40.770102] PGD 0 [ 40.770107] Oops: [#1] SMP [ 40.770114] CPU 10 This patch introduces a static variable, grefs_per_grant_frame, to cache the calculated value. gnttab_init() now calls gnttab_request_version() early so that grant_table_version and grefs_per_grant_frame can be appropriately set. A few BUG_ON()s have been added to prevent this type of bug from reoccurring in the future. Signed-off-by: Matt Wilson Reviewed-and-Tested-by: Steven Noonan Cc: Ian Campbell Cc: Konrad Rzeszutek Wilk Cc: Annie Li Cc: xen-de...@lists.xen.org Cc: linux-kernel@vger.kernel.org Cc: sta...@vger.kernel.org # v3.3 and newer --- Changes since v1: * introduced a new gnttab_setup() function and moved all of the initialization code from gnttab_resume() there. --- drivers/xen/grant-table.c | 52 ++-- 1 files changed, 31 insertions(+), 21 deletions(-) diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index 043bf07..53715de 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -55,10 +55,6 @@ /* External tools reserve first few grant table entries. */ #define NR_RESERVED_ENTRIES 8 #define GNTTAB_LIST_END 0x -#define GREFS_PER_GRANT_FRAME \ -(grant_table_version == 1 ? \ -(PAGE_SIZE / sizeof(struct grant_entry_v1)) : \ -(PAGE_SIZE / sizeof(union grant_entry_v2))) static grant_ref_t **gnttab_list; static unsigned int nr_grant_frames; @@ -153,6 +149,7 @@ static struct gnttab_ops *gnttab_interface; static grant_status_t *grstatus; static int grant_table_version; +static int grefs_per_grant_frame; static struct gnttab_free_callback *gnttab_free_callback_list; @@ -766,12 +763,14 @@ static int grow_gnttab_list(unsigned int more_frames) unsigned int new_nr_grant_frames, extra_entries, i; unsigned int nr_glist_frames, new_nr_glist_frames; + BUG_ON(grefs_per_grant_frame == 0); + new_nr_grant_frames = nr_grant_frames + more_frames; - extra_entries = more_frames * GREFS_PER_GRANT_FRAME; + extra_entries = more_frames * grefs_per_grant_frame; - nr_glist_frames = (nr_grant_frames * GREFS_PER_GRANT_FRAME + RPP - 1) / RPP; + nr_glist_frames = (nr_grant_frames * grefs_per_grant_frame + RPP - 1) / RPP; new_nr_glist_frames = - (new_nr_grant_frames * GREFS_PER_GRANT_FRAME + RPP - 1) / RPP; + (new_nr_grant_frames * grefs_per_grant_frame + RPP - 1) / RPP; for (i = nr_glist_frames; i< new_nr_glist_frames; i++) { gnttab_list[i] = (grant_ref_t *)__get_free_page(GFP_ATOMIC); if (!gnttab_list[i]) @@ -779,12 +778,12 @@ static int grow_gnttab_list(unsigned int more_frames) } - for (i = GREFS_PER_GRANT_FRAME * nr_grant_frames; -i< GREFS_PER_GRANT_FRAME * new_nr_grant_frames - 1; i++) + for (i = grefs_per_grant_frame * nr_grant_frames; +i< grefs_per_grant_frame * new_nr_grant_frames - 1; i++) gnttab_entry(i) = i + 1; gnttab_entry(i) = gnttab_free_head; - gnttab_free_head = GREFS_PER_GRANT_FRAME * nr_grant_frames; + gnttab_free_head = grefs_per_grant_frame * nr_grant_frames; gnttab_free_count += extra_entries; nr_grant_frames = new_nr_grant_frames; @@ -904,7 +903,8 @@ EXPORT_SYMBOL_GPL(gnttab_unmap_refs); static unsigned nr_status_frames(unsigned nr_grant_frames) { - return (nr_grant_frames * GREFS_PER_GRANT_FRAME + SPP - 1) / SPP; + BUG_ON(grefs_per_grant_frame == 0); + return (nr_grant_frames * grefs_per_grant_frame + SPP - 1) / SPP; } static int gnttab_map_frames_v
Re: [Xen-devel] [PATCH v2] xen/grant-table: correctly initialize grant table version 1
On 2013-1-9 20:02, Ian Campbell wrote: On Wed, 2013-01-09 at 02:40 +, ANNIE LI wrote: @@ -1080,18 +1081,18 @@ static void gnttab_request_version(void) panic("we need grant tables version 2, but only version 1 is available"); } else { grant_table_version = 1; + grefs_per_grant_frame = PAGE_SIZE / sizeof(struct grant_entry_v1); gnttab_interface =&gnttab_v1_ops; } - printk(KERN_INFO "Grant tables using version %d layout.\n", - grant_table_version); } Is it better to keep printk here? In your last patch, you removed it because gnttab_request_version and gnttab_resume are all called in gnttab_init. and gnttab_resume also contains calling of gnttab_request_version. But in this patch, gnttab_setup is used, and does not have this issue now. Yes, I think we want to print this at both start of day and resume? Right. Either by adding a print to gnttab_resume() Only adding a print into gnttab_resume() would miss this print at start of day. In gnttab_init, gnttab_request_version and gnttab_setup are called, not gnttab_resume. or by keeping the existing one here in preference to moving it to gnttab_setup(). I'd prefer the latter to avoid the duplication, unless I'm mistaken and request_version is called in more than those two locations. Yes, I'd like the latter. Request_version is only called in those two locations. Thanks Annie Ian. ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH v2] xen/grant-table: correctly initialize grant table version 1
On 2013-1-10 17:16, Matt Wilson wrote: On Wed, Jan 09, 2013 at 12:02:09PM +, Ian Campbell wrote: On Wed, 2013-01-09 at 02:40 +, ANNIE LI wrote: @@ -1080,18 +1081,18 @@ static void gnttab_request_version(void) panic("we need grant tables version 2, but only version 1 is available"); } else { grant_table_version = 1; + grefs_per_grant_frame = PAGE_SIZE / sizeof(struct grant_entry_v1); gnttab_interface =&gnttab_v1_ops; } - printk(KERN_INFO "Grant tables using version %d layout.\n", - grant_table_version); } Is it better to keep printk here? In your last patch, you removed it because gnttab_request_version and gnttab_resume are all called in gnttab_init. and gnttab_resume also contains calling of gnttab_request_version. But in this patch, gnttab_setup is used, and does not have this issue now. Yes, I think we want to print this at both start of day and resume? Either by adding a print to gnttab_resume() or by keeping the existing one here in preference to moving it to gnttab_setup(). I'd prefer the latter to avoid the duplication, unless I'm mistaken and request_version is called in more than those two locations. In this version the printk() is in gnttab_setup(), which is called both at start of day (gnttab_init()) and resume (gnttab_resume()). I can move it back to gnttab_request_version() if you'd like, but this patch should be behaving as expected already. I was thinking to change less from original source code, this patch moves printk without specific purpose and shows same behavior as original code. It is OK if you'd like to keep it. Thanks Annie Matt ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
If there is inflight I/O in any non-last queue, blkback returns -EBUSY directly, and never stops thread of remaining queue and processs them. When removing vbd device with lots of disk I/O load, some queues with inflight I/O still have blkback thread running even though the corresponding vbd device or guest is gone. And this could cause some problems, for example, if the backend device type is file, some loop devices and blkback thread always lingers there forever after guest is destroyed, and this causes failure of umounting repositories unless rebooting the dom0. So stop all threads properly and return -EBUSY if any queue has inflight I/O. Signed-off-by: Annie Li Reviewed-by: Herbert van den Bergh Reviewed-by: Bhavesh Davda Reviewed-by: Adnan Misherfi --- drivers/block/xen-blkback/xenbus.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 792da68..2adb859 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -244,6 +244,7 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) { struct pending_req *req, *n; unsigned int j, r; + bool busy = false; for (r = 0; r < blkif->nr_rings; r++) { struct xen_blkif_ring *ring = &blkif->rings[r]; @@ -261,8 +262,10 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) * don't have any discard_io or other_io requests. So, checking * for inflight IO is enough. */ - if (atomic_read(&ring->inflight) > 0) - return -EBUSY; + if (atomic_read(&ring->inflight) > 0) { + busy = true; + continue; + } if (ring->irq) { unbind_from_irqhandler(ring->irq, ring); @@ -300,6 +303,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages)); ring->active = false; } + if (busy) + return -EBUSY; + blkif->nr_ring_pages = 0; /* * blkif->rings was allocated in connect_ring, so we should free it in -- 1.9.3
Re: [Xen-devel] [PATCH net-next v2 0/9] xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
On 2013/12/13 7:48, Zoltan Kiss wrote: A long known problem of the upstream netback implementation that on the TX path (from guest to Dom0) it copies the whole packet from guest memory into Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a huge perfomance penalty. The classic kernel version of netback used grant mapping, and to get notified when the page can be unmapped, it used page destructors. Unfortunately that destructor is not an upstreamable solution. Ian Campbell's skb fragment destructor patch series [1] tried to solve this problem, however it seems to be very invasive on the network stack's code, and therefore haven't progressed very well. This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to know when the skb is freed up. That is the way KVM solved the same problem, and based on my initial tests it can do the same for us. Avoiding the extra copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node, running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb switch) Sounds good. Is the TX throughput gotten between one vm and one bare metal box? or between multiple vms and bare metal? Do you have any test results with netperf? Thanks Annie Based on my investigations the packet get only copied if it is delivered to Dom0 stack, which is due to this [2] patch. That's a bit unfortunate, but luckily it doesn't cause a major regression for this usecase. In the future we should try to eliminate that copy somehow. There are a few spinoff tasks which will be addressed in separate patches: - grant copy the header directly instead of map and memcpy. This should help us avoiding TLB flushing - use something else than ballooned pages - fix grant map to use page->index properly I will run some more extensive tests, but some basic XenRT tests were already passed with good results. I've tried to broke it down to smaller patches, with mixed results, so I welcome suggestions on that part as well: 1: Introduce TX grant map definitions 2: Change TX path from grant copy to mapping 3: Remove old TX grant copy definitons and fix indentations 4: Change RX path for mapped SKB fragments 5: Add stat counters for zerocopy 6: Handle guests with too many frags 7: Add stat counters for frag_list skbs 8: Timeout packets in RX path 9: Aggregate TX unmap operations v2: I've fixed some smaller things, see the individual patches. I've added a few new stat counters, and handling the important use case when an older guest sends lots of slots. Instead of delayed copy now we timeout packets on the RX path, based on the assumption that otherwise packets should get stucked anywhere else. Finally some unmap batching to avoid too much TLB flush [1] http://lwn.net/Articles/491522/ [2] https://lkml.org/lkml/2012/7/20/363 Signed-off-by: Zoltan Kiss ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/1] xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
In xen_blkif_disconnect, before checking inflight I/O, following code stops the blkback thread, if (ring->xenblkd) { kthread_stop(ring->xenblkd); wake_up(&ring->shutdown_wq); } If there is inflight I/O in any non-last queue, blkback returns -EBUSY directly, and above code would not be called to stop thread of remaining queue and processs them. When removing vbd device with lots of disk I/O load, some queues with inflight I/O still have blkback thread running even though the corresponding vbd device or guest is gone. And this could cause some problems, for example, if the backend device type is file, some loop devices and blkback thread always lingers there forever after guest is destroyed, and this causes failure of umounting repositories unless rebooting the dom0. This patch allows thread of every queue has the chance to get stopped. Otherwise, only thread of queue previous to(including) first busy one get stopped, blkthread of remaining queue will still run. So stop all threads properly and return -EBUSY if any queue has inflight I/O. Signed-off-by: Annie Li Reviewed-by: Herbert van den Bergh Reviewed-by: Bhavesh Davda Reviewed-by: Adnan Misherfi Acked-by: Roger Pau Monné --- v2: enhance patch description and add Acked-by --- drivers/block/xen-blkback/xenbus.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 792da68..2adb859 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -244,6 +244,7 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) { struct pending_req *req, *n; unsigned int j, r; + bool busy = false; for (r = 0; r < blkif->nr_rings; r++) { struct xen_blkif_ring *ring = &blkif->rings[r]; @@ -261,8 +262,10 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) * don't have any discard_io or other_io requests. So, checking * for inflight IO is enough. */ - if (atomic_read(&ring->inflight) > 0) - return -EBUSY; + if (atomic_read(&ring->inflight) > 0) { + busy = true; + continue; + } if (ring->irq) { unbind_from_irqhandler(ring->irq, ring); @@ -300,6 +303,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages)); ring->active = false; } + if (busy) + return -EBUSY; + blkif->nr_ring_pages = 0; /* * blkif->rings was allocated in connect_ring, so we should free it in -- 1.9.3
Re: [Xen-devel] [PATCH 1/1] xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
On 8/23/2017 5:20 PM, Konrad Rzeszutek Wilk wrote: ..snip.. Acked-by: Roger Pau Monné Forgot to add, this needs to be backported to stable branches, so: Annie, could you resend the patch with the tags and an update to the description to me please? Done Thanks Annie Cc: sta...@vger.kernel.org Roger. ___ Xen-devel mailing list xen-de...@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/1] xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
On 8/18/2017 5:14 AM, Roger Pau Monné wrote: On Thu, Aug 17, 2017 at 06:43:46PM -0400, Annie Li wrote: If there is inflight I/O in any non-last queue, blkback returns -EBUSY directly, and never stops thread of remaining queue and processs them. When removing vbd device with lots of disk I/O load, some queues with inflight I/O still have blkback thread running even though the corresponding vbd device or guest is gone. And this could cause some problems, for example, if the backend device type is file, some loop devices and blkback thread always lingers there forever after guest is destroyed, and this causes failure of umounting repositories unless rebooting the dom0. So stop all threads properly and return -EBUSY if any queue has inflight I/O. Signed-off-by: Annie Li Reviewed-by: Herbert van den Bergh Reviewed-by: Bhavesh Davda Reviewed-by: Adnan Misherfi --- drivers/block/xen-blkback/xenbus.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 792da68..2adb859 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -244,6 +244,7 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) { struct pending_req *req, *n; unsigned int j, r; + bool busy = false; for (r = 0; r < blkif->nr_rings; r++) { struct xen_blkif_ring *ring = &blkif->rings[r]; @@ -261,8 +262,10 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) * don't have any discard_io or other_io requests. So, checking * for inflight IO is enough. */ - if (atomic_read(&ring->inflight) > 0) - return -EBUSY; + if (atomic_read(&ring->inflight) > 0) { + busy = true; + continue; + } I guess I'm missing something, but I don't see how this is solving the problem described in the description. If the problem is that xen_blkif_disconnect returns without cleaning all the queues, this patch keeps the current behavior, just that it will try to remove more queues before returning, as opposed to returning when finding the first busy queue. Before checking inflight, following code stops the blkback thread, if (ring->xenblkd) { kthread_stop(ring->xenblkd); wake_up(&ring->shutdown_wq); } This patch allows thread of every queue has the chance to get stopped. Otherwise, only thread of queue before(including) first busy one get stopped, threads of remaining queue will still run, and these blkthread and corresponding loop device will linger forever even after guest is destroyed. Thanks Annie