Re: [Qemu-devel] Memory use with >100 virtio devices
On 24/08/2017 14:30, David Gibson wrote: >> >> Ideas what to tweak or what valgrind tool to try? > valgrind probably isn't that useful at this point. I think we need to > instrument bits of the code to find what the O(n^2) algo is and fix it. > > Seems to me checking if the address_spaces list is growing to O(n^2) > entries would be a good place to start. The address spaces are O(n) (and so are the FlatView and the dispatch tries), but each of them has O(n) size. Eventually we use O(n^2) memory, but we build them O(n) times---which is expensive and also means, due to RCU, that there can be a short amount of time where it is between O(n^2) and O(n^3). The scheme I suggested elsewhere in the thread should cut one "n", by sharing one FlatViews and dispatch trie across all AddressSpaces. Paolo
Re: [Qemu-devel] Memory use with >100 virtio devices
On Thu, Aug 24, 2017 at 07:48:57PM +1000, Alexey Kardashevskiy wrote: > On 21/08/17 15:50, Alexey Kardashevskiy wrote: > > On 21/08/17 14:31, David Gibson wrote: > >> On Fri, Aug 18, 2017 at 02:18:53PM +0100, Stefan Hajnoczi wrote: > >>> On Fri, Aug 18, 2017 at 03:39:20PM +1000, Alexey Kardashevskiy wrote: > ==94451== 4 of 10 > ==94451== max-live:314,649,600 in 150 blocks > ==94451== tot-alloc: 314,649,600 in 150 blocks (avg size 2097664.00) > ==94451== deaths: none (none of these blocks were freed) > ==94451== acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) > ==94451==at 0x4895600: memalign (in > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > ==94451==by 0x48957E7: posix_memalign (in > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > ==94451==by 0xB744AB: qemu_try_memalign (oslib-posix.c:106) > ==94451==by 0xA92053: qemu_try_blockalign (io.c:2493) > ==94451==by 0xA34DDF: qcow2_do_open (qcow2.c:1365) > ==94451==by 0xA35627: qcow2_open (qcow2.c:1526) > ==94451==by 0x9FB94F: bdrv_open_driver (block.c:1109) > ==94451==by 0x9FC413: bdrv_open_common (block.c:1365) > ==94451==by 0x9FF823: bdrv_open_inherit (block.c:2542) > ==94451==by 0x9FFC17: bdrv_open (block.c:2626) > ==94451==by 0xA71027: blk_new_open (block-backend.c:267) > ==94451==by 0x6D3E6B: blockdev_init (blockdev.c:588) > >>> > >>> This allocation is unnecessary. Most qcow2 files are not encrypted so > >>> s->cluster_data does not need to be allocated upfront. > >>> > >>> I'll send a patch. > >> > >> Is that sufficient to explain the problem, I can't quickly see how big > >> that unnecessary allocation is - but would it account for the 10s of > >> gigabytes usage we're seeing here? > >> > >> I'm suspecting we accidentally have a O(n^2) or worse space complexity > >> going on here. > >> > > > > No, it is a small fraction only. See "[PATCH] qcow2: allocate > > cluster_cache/cluster_data on demand" thread for more details. > > > The information was lost there so I'll continue in this thread. > > I run QEMU again, with 2GB of RAM, -initrd+-kernel, pseries, 64 PCI > bridges, -S, no KVM, some virtio-block devices; I run it under "valgrind > --tool=exp-dhat" and exited via "c-a x" as soon as possible. > > The summary of each run is: > > 50 virtio-block devices: > guest_insns: 2,728,740,444 > max_live: 1,214,121,770 in 226,958 blocks > tot_alloc:1,384,726,690 in 310,930 blocks > > > 150 virtio-block devices: > guest_insns: 17,576,279,582 > max_live: 7,454,182,031 in 1,286,128 blocks > tot_alloc:7,958,747,994 in 1,469,719 blocks > > 250 virtio-block devices: > guest_insns: 46,100,928,249 > max_live: 19,423,868,479 in 3,264,833 blocks > tot_alloc:20,262,409,839 in 3,548,220 blocks > > 350 virtio-block devices: > guest_insns: 88,046,403,555 > max_live: 36,994,652,991 in 6,140,203 blocks > tot_alloc:38,167,153,779 in 6,523,206 blocks > > > Memory usage 1) grows a lot 2) grows out of proportion Yup, looks to be growing O(n^2) as I suspected. > 3) QEMU becomes incredibly slow. Not surprising. If memory is growing as O(n^2) chances are good we're also traversing some structure that is growing as O(n^2) which would certainly be slow. > With the hack (below) and 350 virtio-block devices, the summary is: > guest_insns: 7,873,805,573 > max_live: 2,577,738,019 in 2,567,682 blocks > tot_alloc:3,750,238,807 in 2,950,685 blocks > insns per allocated byte: 2 > > > I am also attaching 2 snapshots from the valgrind's "massif" tool, with and > without the hack. > > Ideas what to tweak or what valgrind tool to try? valgrind probably isn't that useful at this point. I think we need to instrument bits of the code to find what the O(n^2) algo is and fix it. Seems to me checking if the address_spaces list is growing to O(n^2) entries would be a good place to start. > > > > The hack is basically excluding virtio-pci-cfg-as from the address_spaces > list (yeah, it breaks QEMU, this is just a hint): > > diff --git a/memory.c b/memory.c > index 02c95d1..118ac7f 100644 > --- a/memory.c > +++ b/memory.c > @@ -2589,6 +2589,7 @@ void address_space_init(AddressSpace *as, > MemoryRegion *root, const char *name) > as->ioeventfd_nb = 0; > as->ioeventfds = NULL; > QTAILQ_INIT(>listeners); > +if (strcmp(name, "virtio-pci-cfg-as")) > QTAILQ_INSERT_TAIL(_spaces, as, address_spaces_link); > as->name = g_strdup(name ? name : "anonymous"); > address_space_init_dispatch(as); > > > > > 57 85,481,838,874 37,075,366,088 36,993,595,91381,770,175> 0 > 99.78% (36,993,595,913B) (heap allocation functions) malloc/new/new[], > --alloc-fns, etc. > ->95.82% (35,527,301,159B) 0x59E7392: g_realloc (in > /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) > |
Re: [Qemu-devel] Memory use with >100 virtio devices
On 21/08/17 15:50, Alexey Kardashevskiy wrote: > On 21/08/17 14:31, David Gibson wrote: >> On Fri, Aug 18, 2017 at 02:18:53PM +0100, Stefan Hajnoczi wrote: >>> On Fri, Aug 18, 2017 at 03:39:20PM +1000, Alexey Kardashevskiy wrote: ==94451== 4 of 10 ==94451== max-live:314,649,600 in 150 blocks ==94451== tot-alloc: 314,649,600 in 150 blocks (avg size 2097664.00) ==94451== deaths: none (none of these blocks were freed) ==94451== acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) ==94451==at 0x4895600: memalign (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x48957E7: posix_memalign (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0xB744AB: qemu_try_memalign (oslib-posix.c:106) ==94451==by 0xA92053: qemu_try_blockalign (io.c:2493) ==94451==by 0xA34DDF: qcow2_do_open (qcow2.c:1365) ==94451==by 0xA35627: qcow2_open (qcow2.c:1526) ==94451==by 0x9FB94F: bdrv_open_driver (block.c:1109) ==94451==by 0x9FC413: bdrv_open_common (block.c:1365) ==94451==by 0x9FF823: bdrv_open_inherit (block.c:2542) ==94451==by 0x9FFC17: bdrv_open (block.c:2626) ==94451==by 0xA71027: blk_new_open (block-backend.c:267) ==94451==by 0x6D3E6B: blockdev_init (blockdev.c:588) >>> >>> This allocation is unnecessary. Most qcow2 files are not encrypted so >>> s->cluster_data does not need to be allocated upfront. >>> >>> I'll send a patch. >> >> Is that sufficient to explain the problem, I can't quickly see how big >> that unnecessary allocation is - but would it account for the 10s of >> gigabytes usage we're seeing here? >> >> I'm suspecting we accidentally have a O(n^2) or worse space complexity >> going on here. >> > > No, it is a small fraction only. See "[PATCH] qcow2: allocate > cluster_cache/cluster_data on demand" thread for more details. The information was lost there so I'll continue in this thread. I run QEMU again, with 2GB of RAM, -initrd+-kernel, pseries, 64 PCI bridges, -S, no KVM, some virtio-block devices; I run it under "valgrind --tool=exp-dhat" and exited via "c-a x" as soon as possible. The summary of each run is: 50 virtio-block devices: guest_insns: 2,728,740,444 max_live: 1,214,121,770 in 226,958 blocks tot_alloc:1,384,726,690 in 310,930 blocks 150 virtio-block devices: guest_insns: 17,576,279,582 max_live: 7,454,182,031 in 1,286,128 blocks tot_alloc:7,958,747,994 in 1,469,719 blocks 250 virtio-block devices: guest_insns: 46,100,928,249 max_live: 19,423,868,479 in 3,264,833 blocks tot_alloc:20,262,409,839 in 3,548,220 blocks 350 virtio-block devices: guest_insns: 88,046,403,555 max_live: 36,994,652,991 in 6,140,203 blocks tot_alloc:38,167,153,779 in 6,523,206 blocks Memory usage 1) grows a lot 2) grows out of proportion 3) QEMU becomes incredibly slow. With the hack (below) and 350 virtio-block devices, the summary is: guest_insns: 7,873,805,573 max_live: 2,577,738,019 in 2,567,682 blocks tot_alloc:3,750,238,807 in 2,950,685 blocks insns per allocated byte: 2 I am also attaching 2 snapshots from the valgrind's "massif" tool, with and without the hack. Ideas what to tweak or what valgrind tool to try? The hack is basically excluding virtio-pci-cfg-as from the address_spaces list (yeah, it breaks QEMU, this is just a hint): diff --git a/memory.c b/memory.c index 02c95d1..118ac7f 100644 --- a/memory.c +++ b/memory.c @@ -2589,6 +2589,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name) as->ioeventfd_nb = 0; as->ioeventfds = NULL; QTAILQ_INIT(>listeners); +if (strcmp(name, "virtio-pci-cfg-as")) QTAILQ_INSERT_TAIL(_spaces, as, address_spaces_link); as->name = g_strdup(name ? name : "anonymous"); address_space_init_dispatch(as); -- Alexey 57 85,481,838,874 37,075,366,088 36,993,595,91381,770,1750 99.78% (36,993,595,913B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->95.82% (35,527,301,159B) 0x59E7392: g_realloc (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) | ->94.64% (35,089,885,088B) 0x59E7756: g_realloc_n (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) | | ->90.60% (33,590,452,224B) 0x3A785A: phys_map_node_reserve (exec.c:251) | | | ->90.60% (33,590,452,224B) 0x3A7CE2: phys_page_set (exec.c:307) | | | ->90.60% (33,590,452,224B) 0x3AAF26: register_multipage (exec.c:1345) | | | ->90.60% (33,590,452,224B) 0x3AB31E: mem_add (exec.c:1376) | | | ->90.55% (33,573,015,552B) 0x437F52: address_space_update_topology_pass (memory.c:855) | | | | ->90.55% (33,573,015,552B) 0x4382C2: address_space_update_topology (memory.c:889) | | | | ->90.55% (33,573,015,552B) 0x438502: memory_region_transaction_commit (memory.c:925) | | | | ->54.09% (20,055,085,056B) 0x43D07E:
Re: [Qemu-devel] Memory use with >100 virtio devices
On 21/08/17 14:31, David Gibson wrote: > On Fri, Aug 18, 2017 at 02:18:53PM +0100, Stefan Hajnoczi wrote: >> On Fri, Aug 18, 2017 at 03:39:20PM +1000, Alexey Kardashevskiy wrote: >>> ==94451== 4 of 10 >>> ==94451== max-live:314,649,600 in 150 blocks >>> ==94451== tot-alloc: 314,649,600 in 150 blocks (avg size 2097664.00) >>> ==94451== deaths: none (none of these blocks were freed) >>> ==94451== acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) >>> ==94451==at 0x4895600: memalign (in >>> /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) >>> ==94451==by 0x48957E7: posix_memalign (in >>> /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) >>> ==94451==by 0xB744AB: qemu_try_memalign (oslib-posix.c:106) >>> ==94451==by 0xA92053: qemu_try_blockalign (io.c:2493) >>> ==94451==by 0xA34DDF: qcow2_do_open (qcow2.c:1365) >>> ==94451==by 0xA35627: qcow2_open (qcow2.c:1526) >>> ==94451==by 0x9FB94F: bdrv_open_driver (block.c:1109) >>> ==94451==by 0x9FC413: bdrv_open_common (block.c:1365) >>> ==94451==by 0x9FF823: bdrv_open_inherit (block.c:2542) >>> ==94451==by 0x9FFC17: bdrv_open (block.c:2626) >>> ==94451==by 0xA71027: blk_new_open (block-backend.c:267) >>> ==94451==by 0x6D3E6B: blockdev_init (blockdev.c:588) >> >> This allocation is unnecessary. Most qcow2 files are not encrypted so >> s->cluster_data does not need to be allocated upfront. >> >> I'll send a patch. > > Is that sufficient to explain the problem, I can't quickly see how big > that unnecessary allocation is - but would it account for the 10s of > gigabytes usage we're seeing here? > > I'm suspecting we accidentally have a O(n^2) or worse space complexity > going on here. > No, it is a small fraction only. See "[PATCH] qcow2: allocate cluster_cache/cluster_data on demand" thread for more details. -- Alexey signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Memory use with >100 virtio devices
On Fri, Aug 18, 2017 at 02:18:53PM +0100, Stefan Hajnoczi wrote: > On Fri, Aug 18, 2017 at 03:39:20PM +1000, Alexey Kardashevskiy wrote: > > ==94451== 4 of 10 > > ==94451== max-live:314,649,600 in 150 blocks > > ==94451== tot-alloc: 314,649,600 in 150 blocks (avg size 2097664.00) > > ==94451== deaths: none (none of these blocks were freed) > > ==94451== acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) > > ==94451==at 0x4895600: memalign (in > > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > > ==94451==by 0x48957E7: posix_memalign (in > > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > > ==94451==by 0xB744AB: qemu_try_memalign (oslib-posix.c:106) > > ==94451==by 0xA92053: qemu_try_blockalign (io.c:2493) > > ==94451==by 0xA34DDF: qcow2_do_open (qcow2.c:1365) > > ==94451==by 0xA35627: qcow2_open (qcow2.c:1526) > > ==94451==by 0x9FB94F: bdrv_open_driver (block.c:1109) > > ==94451==by 0x9FC413: bdrv_open_common (block.c:1365) > > ==94451==by 0x9FF823: bdrv_open_inherit (block.c:2542) > > ==94451==by 0x9FFC17: bdrv_open (block.c:2626) > > ==94451==by 0xA71027: blk_new_open (block-backend.c:267) > > ==94451==by 0x6D3E6B: blockdev_init (blockdev.c:588) > > This allocation is unnecessary. Most qcow2 files are not encrypted so > s->cluster_data does not need to be allocated upfront. > > I'll send a patch. Is that sufficient to explain the problem, I can't quickly see how big that unnecessary allocation is - but would it account for the 10s of gigabytes usage we're seeing here? I'm suspecting we accidentally have a O(n^2) or worse space complexity going on here. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] Memory use with >100 virtio devices
On Fri, Aug 18, 2017 at 03:39:20PM +1000, Alexey Kardashevskiy wrote: > ==94451== 4 of 10 > ==94451== max-live:314,649,600 in 150 blocks > ==94451== tot-alloc: 314,649,600 in 150 blocks (avg size 2097664.00) > ==94451== deaths: none (none of these blocks were freed) > ==94451== acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) > ==94451==at 0x4895600: memalign (in > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > ==94451==by 0x48957E7: posix_memalign (in > /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) > ==94451==by 0xB744AB: qemu_try_memalign (oslib-posix.c:106) > ==94451==by 0xA92053: qemu_try_blockalign (io.c:2493) > ==94451==by 0xA34DDF: qcow2_do_open (qcow2.c:1365) > ==94451==by 0xA35627: qcow2_open (qcow2.c:1526) > ==94451==by 0x9FB94F: bdrv_open_driver (block.c:1109) > ==94451==by 0x9FC413: bdrv_open_common (block.c:1365) > ==94451==by 0x9FF823: bdrv_open_inherit (block.c:2542) > ==94451==by 0x9FFC17: bdrv_open (block.c:2626) > ==94451==by 0xA71027: blk_new_open (block-backend.c:267) > ==94451==by 0x6D3E6B: blockdev_init (blockdev.c:588) This allocation is unnecessary. Most qcow2 files are not encrypted so s->cluster_data does not need to be allocated upfront. I'll send a patch. Stefan signature.asc Description: PGP signature
[Qemu-devel] Memory use with >100 virtio devices
Hi! We have received a report that qemu cannot handle hundreds of virtio devices and crashes. I tried qemu with 150 virtio-block devices, 1 CPU and and 2GB RAM (the exact command line is at the end) and found that it took more than 5.5GB resident and 9GB virtual memory. Bit weird, I tried valgrind and from what I see, most memory was lost in enormous amount of realloc()s. ~700 devices was my maximum on a 128GB host, QEMU with more than that would be killed by kernel's OOM. It does not sound like a problem which needs an urgent attention but I am still curious - is there a way to tell libc's allocator to be smarter? ==94451== Parent PID: 94138 ==94451== ==94451== ==94451== SUMMARY STATISTICS ==94451== ==94451== guest_insns: 17,414,575,724 ==94451== ==94451== max_live: 8,103,226,799 in 1,288,978 blocks ==94451== ==94451== tot_alloc:9,256,734,636 in 1,474,121 blocks ==94451== ==94451== insns per allocated byte: 1 ==94451== ==94451== ==94451== ORDERED BY decreasing "max-bytes-live": top 10 allocators ==94451== ==94451== 1 of 10 ==94451== max-live:3,805,483,008 in 77,462 blocks ==94451== tot-alloc: 3,805,483,008 in 77,462 blocks (avg size 49127.09) ==94451== deaths: 77,309, at avg age 10,068,510,919 (57.81% of prog lifetime) ==94451== acc-ratios: 0.29 rd, 0.24 wr (1,135,549,612 b-read, 950,066,280 b-written) ==94451==at 0x48920C4: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x4895517: realloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x59E7393: g_realloc (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x59E7757: g_realloc_n (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x3A785B: phys_map_node_reserve (exec.c:251) ==94451==by 0x3A7CE3: phys_page_set (exec.c:307) ==94451==by 0x3AAF27: register_multipage (exec.c:1345) ==94451==by 0x3AB31F: mem_add (exec.c:1376) ==94451==by 0x437F53: address_space_update_topology_pass (memory.c:855) ==94451==by 0x4382C3: address_space_update_topology (memory.c:889) ==94451==by 0x438503: memory_region_transaction_commit (memory.c:925) ==94451==by 0x43D07F: memory_region_update_container_subregions (memory.c:2136) ==94451== ==94451== 2 of 10 ==94451== max-live:1,332,215,808 in 27,104 blocks ==94451== tot-alloc: 1,332,215,808 in 27,104 blocks (avg size 49152.00) ==94451== deaths: 27,104, at avg age 2,360,954,537 (13.55% of prog lifetime) ==94451== acc-ratios: 0.30 rd, 0.25 wr (399,684,736 b-read, 333,887,488 b-written) ==94451==at 0x48920C4: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x4895517: realloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x59E7393: g_realloc (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x59E7757: g_realloc_n (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x3A785B: phys_map_node_reserve (exec.c:251) ==94451==by 0x3A7CE3: phys_page_set (exec.c:307) ==94451==by 0x3AAF27: register_multipage (exec.c:1345) ==94451==by 0x3AB31F: mem_add (exec.c:1376) ==94451==by 0x437F53: address_space_update_topology_pass (memory.c:855) ==94451==by 0x4382C3: address_space_update_topology (memory.c:889) ==94451==by 0x438503: memory_region_transaction_commit (memory.c:925) ==94451==by 0x43D3CF: memory_region_set_enabled (memory.c:2186) ==94451== ==94451== 3 of 10 ==94451== max-live:1,229,537,280 in 25,026 blocks ==94451== tot-alloc: 1,229,537,280 in 25,026 blocks (avg size 49130.39) ==94451== deaths: 25,026, at avg age 10,091,400,121 (57.94% of prog lifetime) ==94451== acc-ratios: 0.29 rd, 0.24 wr (366,871,968 b-read, 306,940,392 b-written) ==94451==at 0x48920C4: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x4895517: realloc (in /usr/lib/valgrind/vgpreload_exp-dhat-ppc64le-linux.so) ==94451==by 0x59E7393: g_realloc (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x59E7757: g_realloc_n (in /lib/powerpc64le-linux-gnu/libglib-2.0.so.0.5000.2) ==94451==by 0x3A785B: phys_map_node_reserve (exec.c:251) ==94451==by 0x3A7CE3: phys_page_set (exec.c:307) ==94451==by 0x3AAF27: register_multipage (exec.c:1345) ==94451==by 0x3AB31F: mem_add (exec.c:1376) ==94451==by 0x437F53: address_space_update_topology_pass (memory.c:855) ==94451==by 0x4382C3: address_space_update_topology (memory.c:889) ==94451==by 0x438503: memory_region_transaction_commit (memory.c:925) ==94451==by 0x43F017: address_space_init (memory.c:2596) ==94451== ==94451== 4 of 10 ==94451== max-live:314,649,600 in 150 blocks ==94451== tot-alloc: 314,649,600 in 150