** Changed in: linux (Ubuntu)
Assignee: Colin Ian King (colin-king) => (unassigned)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1838575
Title:
passthrough devices cause >17min boot delay
To
As outlined in the past conceptually there is nothing that qemu can do.
The kernel can in theory get memory zeroing to become concurrent and thereby
scale with CPUs but that is an effort that was already started twice and didn't
get into the kernel yet.
Workarounds are known to shrink that size
As qemu (seems) to be unable to do much I'll set it to triaged (we
understand what is going on) and low (can't do much).
** Changed in: qemu (Ubuntu)
Status: Incomplete => Triaged
** Changed in: qemu (Ubuntu)
Importance: Medium => Low
--
You received this bug notification because you
I modified the kernel to have a few functions non-inlined to be better tracable:
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_pages_remote
vfio_iommu_map
Then run tracing on this load with limited to the functions in my focus:
$ sudo tra
(systemtap)
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl") {
printf("New vfio_iommu_type1_ioctl\n");
start_stopwatch("vfioioctl");
}
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl").return {
timer=read_stopwatch_ns("vfioioctl")
printf("Complet
This is a silly but useful distribution check with log10 of the allocation
sizes:
Fast:
108 3
1293 4
12133 5
113330 6
27794 7
1119 8
Slow:
194 3
1738 4
17375 5
143411 6
55 7
3 8
I got no warnings about missed call
The iommu is locked in there early and the iommu element is what is passed from
userspace.
That represents the vfio container for this device (container->fd)
qemu:
if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0
kernel:
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigne
Each qemu (version) is slightly different in the road to this, but then
seems to behave.
This one is slightly better to get "in front" of the slow call to map all the
memory.
$ virsh nodedev-detach pci__21_00_1 --driver vfio
$ gdb /usr/bin/qemu-system-x86_64
(gdb) b vfio_dma_map
(gdb) command
I could next build a test kernel with some debug around the vfio iommu dma map
to check how time below that call is spent.
I'm sure that data already is hidden in some of my trace data, but to
eventually change/experiment I need to build one anyway.
I expect anyway to summarize and go into a dis
Reference:
this is the call from qemu that I think we see above (on x86) is at [1].
If this time the assumption is correct the kernel place would be at
vfio_iommu_type1_ioctl.
For debugging:
$ gdb qemu/x86_64-softmmu/qemu-system-x86_64
(gdb) catch syscall 16
(gdb) run -m 131072 -smp 1 -no-user-co
Many ioctls (as expected) but they are all fast and match what we knew from
strace.
Thread 1 "qemu-system-x86" hit Catchpoint 1 (call to syscall ioctl),
0x772fae0b in ioctl () at ../sysdeps/unix/syscall-template.S:78
78 in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0 0x772
The above was through libvirt, doing that directly in qemu now to throw
it into debugging more easily:
$ virsh nodedev-detach pci__21_00_1 --driver vfio
$ qemu/x86_64-softmmu/qemu-system-x86_64 -name guest=test-vfio-slowness
-m 131072 -smp 1 -no-user-config -drive
file=/var/lib/uvtool/libvirt
Just when I thought I understood the pattern.
Sixth run (again kill and restart)
6384 9.826097 <... ioctl resumed> , 0x7ffcc8ed6e20) = 0 <19.495688>
So for now lets summarize that it varies :-/
But it always seems slow.
--
You received this bug notification because you are a member of Ubu
On x86 this looks pretty similar and at the place we have seen before:
45397 0.73 readlink("/sys/bus/pci/devices/:21:00.1/iommu_group",
"../../../../kernel/iommu_groups/"..., 4096) = 34 <0.20>
45397 0.53 openat(AT_FDCWD, "/dev/vfio/45", O_RDWR|O_CLOEXEC) = 31
<0.33>
I built qemu head from git
$ export CFLAGS="-O0 -g"
$ ./configure --disable-user --disable-linux-user --disable-docs
--disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen
--disable-brlapi --enable-fdt --disable-bluez --disable-vde --disable-rbd
--disable-libiscsi --disab
Hmm, with strace showing almost a hang on a single of those ioctl calls
you'D think that is easy to spot :-/
But this isn't as clear as expected:
sudo trace-cmd record -p function_graph -l vfio_pci_ioctl -O graph-time
Disable all but 1 CPUs to have less concurrency in the trace.
=> Not much bet
On this platform strace still confirms the same paths:
And perf as well (slight arch differences, but still mem setup).
46.85% [kernel] [k] lruvec_lru_size
16.89% [kernel] [k] clear_user_page
5.74% [kernel] [k] inacti
As assumed this really seems to be cross arch and for all sizes.
Here 16 PU, 128G on ppc64el:
#1: 54 seconds
#2: 7 seconds
#3: 23 seconds
Upped to 192GB this has:
#1: 75 seconds
#2: 5 seconds
#3: 23 seconds
As a note, in this case I checked there are ~7 seconds before it does
into thi
You can do so even per-size via e.g.
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
As discussed the later the allocation the higher the chance to fail, so
re-check the sysfs file after each change if it actually got that much memory.
The default size is only a boot time parameter.
Bu
** Changed in: linux (Ubuntu)
Importance: Undecided => Medium
** Changed in: linux (Ubuntu)
Assignee: (unassigned) => Colin Ian King (colin-king)
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/
Naive question: can we tweak the hugepage file settings at run time via
/proc/sys/vm/nr_hugepages and not require the kernel parameters?
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1838575
Title:
A trace into the early phase (first some other init with ~16 cpus) then the
long phase of 1 thread blocking.
So we will see it "enter" the slow phase as well as iterating in it.
There seem to be two phases one around alloc_pages_current and then one around
slot_rmap_walk_next/rmap_get_first.
I'd
Summary:
As I mentioned before (on the other bug that I referred).
The problem is that with a PT device it needs to reset and map the VDIO devices.
So with >0 PT devices attached it needs an init that scales with memory size of
the guest (see my fast results with PT but small guest memory).
As I
I mentioned in the last discussion around this, that the one thing that
could be done is to make this single thread mem-init a multi thread
action (in the kernel). I doubt that we can make it omit the
initialization. Even though it is faster, even the 1G Huge Page setup
could be more efficient.
To
I was bumping up to the config you had (but with one PT device).
- Host phys bits machine type for larger mappings
- more CPUS 1->32
Adding/removing a PT device in the configs above doesn't change a lot.
As assumed none of these increased the time tremendously.
Then I went to bump up the memory
Now to gain a good result, lets use 1G Huge Pages.
Kernel cmdline:
default_hugepagesz=1G hugepagesz=1G hugepages=1210
Gives:
HugePages_Total:1210
HugePages_Free: 1210
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:1048576 kB
Guest config extra:
Slightly ch
IMHO this is the same we discussed around March this year
=> https://bugs.launchpad.net/nvidia-dgx-2/+bug/1818891/comments/5.
In an associated mail thread we even discussed the pro/con of changes like
We see above this is about (transparent huge page) setup for all the memory.
We can disable h
T3: use 1.2 TB with one PT device - THP=off
#1: 476 seconds
#2: 31 seconds
#3: 20 seconds
ubuntu@akis:~$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
never
ubuntu@akis:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
Samples: 88K of event 'cyc
Since I knew memory often is more painful - start with 512MB, 1CPU, 1 PCI
Passthrough
Note: I installed debug symbols for glibc and qemu
On init I find initially the guests CPU thread rather busy (well, booting up)
80.66% CPU 0/KVM[kernel]
Passthrough is successful - lspci from guest:
I have seen slow boots before, but it scaled with the amount of devices
and memory reaching like ~5min (for >1TB mem init) and ~2.5min for 16
device pass-through.
Those times (the ones I mentioned) Nvidia has seen and sort of accepted.
I discussed with Anish about how using HugePages help to reduc
Attached file is already the stdio, sorry for prev message.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1838575
Title:
passthrough devices cause >17min boot delay
To manage notifications about th
> > sudo perf record -a -F $((250*100)) -e cycles:u --
> > /usr/bin/qemu-system-x86_64 -name guest="guest"
>
> I instead attached perf to the qemu process after it was spawned by
> libvirt so I didn't have to worry about producing a working qemu
> cmdline. I let it run for several seconds whil
** Changed in: qemu (Ubuntu)
Importance: Undecided => Medium
** Changed in: qemu (Ubuntu)
Assignee: (unassigned) => Rafael David Tinoco (rafaeldtinoco)
** Changed in: qemu (Ubuntu)
Status: New => Confirmed
--
You received this bug notification because you are a member of Ubuntu
B
** Attachment added: "sample xml w/ devices passed through"
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1838575/+attachment/5280232/+files/config5-new.xml
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchp
Here's a perf report of a 'sudo perf record -p 6949 -a -F 25000 -e
cycles', after libvirt spawned qemu w/ pid 6949.
** Attachment added: "perf.report"
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1838575/+attachment/5280233/+files/perf.report
--
You received this bug notification beca
35 matches
Mail list logo