On Thu, Jun 15, 2017 at 10:31:11AM +0800, Peter Xu wrote:
> On Wed, Jun 14, 2017 at 09:34:52PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Jun 12, 2017 at 12:04:58PM +0800, Peter Xu wrote:
> > > On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > > > > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > > > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > > > > The problem is that when I was fixing the problem that 
> > > > > > > > > > vhost had with
> > > > > > > > > > PT (a764040, "exec: abstract 
> > > > > > > > > > address_space_do_translate()"), I did
> > > > > > > > > > broke the IOTLB translation a bit (it was using page 
> > > > > > > > > > masks). IMHO we
> > > > > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > > > > 
> > > > > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > > > > iommu_platform switching, that'll be the best, then I can 
> > > > > > > > > > rewrite
> > > > > > > > > > patch 3 with the switching logic rather than caching 
> > > > > > > > > > anything. But
> > > > > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > > > > 
> > > > > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Can we drop masks completely and replace with length? I think 
> > > > > > > > > we
> > > > > > > > > should do that instead of trying to fix masks.
> > > > > > > > 
> > > > > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > > > > 
> > > > > > > I think it's better than alternatives.
> > > > > > > 
> > > > > > > > Again, I am not sure this is good... At least we need to get 
> > > > > > > > ack from
> > > > > > > > David since spapr should be the initial user of it, and 
> > > > > > > > possibly also
> > > > > > > > Alex since vfio should be assuming that (IIUC both in QEMU and 
> > > > > > > > kernel)
> > > > > > > > addr_mask is page masks rather than arbirary length.
> > > > > > > > 
> > > > > > > > (CC Alex)
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > 
> > > > > > > Callbacks that need powers of two can easily split up the range.
> > > > > > 
> > > > > > I think I missed part of the thread.  What's the original use case 
> > > > > > for
> > > > > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > > > > 
> > > > > Currently address_space_get_iotlb_entry() didn't really follow the
> > > > > rule, addr_mask can be arbitary length. This series tried to fix it,
> > > > > while Michael was questioning about whether we should really fix that
> > > > > at all.
> > > > > 
> > > > > Michael,
> > > > > 
> > > > > Even if for performance's sake, I should still think we should fix it.
> > > > > Let's consider a most simple worst case: we have a single page mapped
> > > > > with IOVA range (2M page):
> > > > > 
> > > > >  [0x0, 0x200000)
> > > > > 
> > > > > And if guest access IOVA using the following patern:
> > > > > 
> > > > >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > > > > 
> > > > > Then now we'll get this:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > > > > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > > > > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > > > > - ...
> > > > 
> > > > We pass an offset too, do we not? So callee can figure out
> > > > the region starts at 0x0 and avoid 2nd and 3rd misses.
> > > 
> > > Here when you say "offset", do you mean the offset in
> > > MemoryRegionSection?
> > > 
> > > In address_space_get_iotlb_entry() we have this:
> > > 
> > >     section = address_space_do_translate(as, addr, &xlat, &plen,
> > >                                          is_write, false);
> > > 
> > > One thing to mention is that, imho we cannot really assume the xlat is
> > > valid on the whole "section" range - the section can be a huge GPA
> > > range, while the xlat may only be valid on a single 4K page. The only
> > > safe region we can use here is (xlat, xlat+plen). Outside that, we
> > > should know nothing valid.
> 
> [1]
> 
> > > 
> > > Please correct me if I didn't really catch the point..
> > 
> > IIUC section is the translation result. If so all of it is valid,
> > not just one page.
> 
> It should be only one page (I think that depends on how we implemented
> the translation logic). Please check this (gdb session on current
> master):
> 
> (gdb) b address_space_get_iotlb_entry 
> Breakpoint 2 at 0x55555576803a: file /root/git/qemu/exec.c, line 512.
> ... (until break)
> (gdb) bt
> #0  0x000055555576803a in address_space_get_iotlb_entry (as=0x55555841bbb0, 
> addr=4294959104, is_write=false) at /root/git/qemu/exec.c:512
> #1  0x00005555558322dd in vhost_device_iotlb_miss (dev=0x5555568b03a0, 
> iova=4294959104, write=0) at /root/git/qemu/hw/virtio/vhost.c:982
> #2  0x0000555555834ad9 in vhost_backend_handle_iotlb_msg (dev=0x5555568b03a0, 
> imsg=0x7fffffffd428) at /root/git/qemu/hw/virtio/vhost-backend.c:334
> #3  0x00005555558347be in vhost_kernel_iotlb_read (opaque=0x5555568b03a0) at 
> /root/git/qemu/hw/virtio/vhost-backend.c:204
> #4  0x0000555555c7e7c3 in aio_dispatch_handlers (ctx=0x5555568091b0) at 
> /root/git/qemu/util/aio-posix.c:399
> #5  0x0000555555c7e956 in aio_dispatch (ctx=0x5555568091b0) at 
> /root/git/qemu/util/aio-posix.c:430
> #6  0x0000555555c7a722 in aio_ctx_dispatch (source=0x5555568091b0, 
> callback=0x0, user_data=0x0) at /root/git/qemu/util/async.c:261
> #7  0x00007f98c62f3703 in g_main_context_dispatch () at 
> /lib64/libglib-2.0.so.0
> #8  0x0000555555c7d36c in glib_pollfds_poll () at 
> /root/git/qemu/util/main-loop.c:213
> #9  0x0000555555c7d468 in os_host_main_loop_wait (timeout=29311167) at 
> /root/git/qemu/util/main-loop.c:261
> #10 0x0000555555c7d521 in main_loop_wait (nonblocking=0) at 
> /root/git/qemu/util/main-loop.c:517
> #11 0x00005555558fe234 in main_loop () at /root/git/qemu/vl.c:1918
> #12 0x0000555555905fe8 in main (argc=21, argv=0x7fffffffda48, 
> envp=0x7fffffffdaf8) at /root/git/qemu/vl.c:4752
> (gdb) s
> 517         plen = (hwaddr)-1;
> (gdb)
> 520         section = address_space_do_translate(as, addr, &xlat, &plen,
> (gdb) n
> 524         if (section.mr == &io_mem_unassigned) {
> (gdb) p/x xlat
> $2 = 0x73900000
> (gdb) p/x addr
> $3 = 0xffffe000
> (gdb) p/x plen
> $4 = 0x1000
> (gdb) p section
> $5 = {mr = 0x5555569ad790, address_space = 0x5555562f9b40 
> <address_space_memory>, offset_within_region = 1048576, size = 
> 0x0000000000000000000000007ff00000, 
>   offset_within_address_space = 1048576, readonly = false}
> 
> Here $2-$4 shows that we are translating IOVA 0xffffe000 into
> [0x73900000, 0x73901000) range, while we can see the section is having
> size as 0x7ff00000,

> which is ranged from [0x100000000, 0x17ff00000).
> That's the upper RAM that the guest has, corresponds to what we can
> see in "info mtree -f" (the guest contains 4G RAM, this is upper 2G):
> 
>   0000000100000000-000000017fffffff (prio 0, ram): pc.ram @0000000080000000

Hmm I made a mistake here... It should be the lower 2G mem (not
upper), which ranges from:

  0000000000100000-000000007fffffff (prio 0, ram): pc.ram @0000000000100000

Though the main point still stands below...

> 
> Obvious, we cannot assume we have a linear mapping on this whole
> range. So I don't think we can really use section info (though the
> section can be used to obtain some offset information, or address
> space the range belongs) to build up the IOTLB, but only the plen.
> Then, we need to fix current codes.
> 
> And this is exactly what I meant above at [1].  Hope this clarifies a
> bit on the issue.  Thanks,
> 
> > 
> > > > 
> > > > 
> > > > > We'll all cache miss along the way until we access 0x0. While if we
> > > > > are with page mask, we'll get:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > > > > - request 0x1ffffe, cache hit
> > > > > - request 0x1ffffd, cache hit
> > > > > - ...
> > > > > 
> > > > > We'll only miss at the first IO.
> > > > 
> > > > I think we should send as much info as we can.
> > > > There should be a way to find full region start and length.
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> 
> -- 
> Peter Xu
> 

-- 
Peter Xu

Reply via email to