On Tue, Feb 19, 2019 at 02:33:04PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > > > I would prefer to have the write barrier before writing 
> > > > > > > > > > > the idx.
> > > > > > > > > > Well that's driver overhead for something device might 
> > > > > > > > > > never utilise in
> > > > > > > > > > a given workload. If we are optimizing let's optimize for 
> > > > > > > > > > speed.
> > > > > > > > > I think doing the barrier before writing idx is best for 
> > > > > > > > > speed (see below).
> > > > > > > > I don't see it below:(
> > > > > > > Sorry, I'm not being clear.  If the write barrier is before the 
> > > > > > > idx, then a
> > > > > > > PV device can read the idx, do a single rmb and read a whole 
> > > > > > > bunch of
> > > > > > > descriptors.  As things stand today a PV device has to do an rmb 
> > > > > > > for each
> > > > > > > descriptor that it reads.
> > > > > > > 
> > > > > > > I'm not sure, but this may be what Jason meant when he said 
> > > > > > > "prefetch".
> > > > > Yes.
> > > > > 
> > > > > 
> > > > > > Oh. Right. So for PV it's easy to convert an rmb to a data 
> > > > > > dependency
> > > > > > instead.  It remains to be seen whether an extra cache miss per
> > > > > > batch is cheaper or more expensive than such a dependency
> > > > > > per descriptor.
> > > > > > 
> > > > > > I hope Jason can experiment and let us know.
> > > > > > 
> > > > > > 
> > > > > To clarify, actually I did play someting like:
> > > > > 
> > > > > 
> > > > > if (desc_is_avail((last_avail + 32) % size))
> > > > > 
> > > > >       [last_avail, last_avail + 32] is avail [1]
> > > > > 
> > > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > > 
> > > > >       last_avail is avail
> > > > > 
> > > > > else
> > > > > 
> > > > >      no new
> > > > > 
> > > > > 
> > > > > And looks like [1] depends on the driver to do:
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc (but not make it available for the flag)
> > > > > 
> > > > > wmb()
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc.flag
> > > > > 
> > > > > 
> > > > > This means if last_avail + 32 is avail, no need to check flags of
> > > > > [last_avail, last_avail + 31] since the descriptor content (except 
> > > > > for the
> > > > > flag) is guaranteed to be correct.
> > > > We can try looking into that, sure. It's not just wmb though which is
> > > > anyway free on x86. There are cases of e.g. chained descriptors
> > > > to consider where we do not want device to see a partial chain.
> > > 
> > > If we saw the last descriptor is in a chain, it's still doesn't harm, just
> > > read more?
> > So it's possible to add a variant that sticks a wmb after each flag
> > update for sure. However if you are writing out batches that's an
> > upfront cost and what you gain is prefetch on the other side which is
> > speculative.
> 
> 
> Yes.
> 
> 
> > 
> > 
> > > > 
> > > > > I get ~10% PPS improvement.
> > > > > 
> > > > > Thanks
> > > > I suspect most of the gain isn't in the check at all though.
> > > 
> > > The gain was:
> > > 
> > > 1) batching reading descriptors, saving the times of userspace access
> > Yes that needs to be fixed. The way linux vhost driver does userspace
> > accesses right now with stac/clac and memory barriers within a loop is
> > plain silly.  Designing a host/guest interface to work around that kind
> > of inefficiency makes no sense, it's purely an implementation issue.
> 
> 
> Actually, it has other advantages, consider the case when driver is faster,
> batch reading may reduce the cache contention on the descriptor ring (when
> the ring is almost full).

A shared index implies always doing a cache miss though.
By comparison speculating a read ahead of a cache line
will sometimes bring you a cache.

> 
> > 
> > > 2) reduce the rmb(), otherwise you need rmb per descriptor
> > Well not really, you can
> > - validate a batch then do a single rmb
> > - replace rmb with a dependency
> > 
> > > > It's probably reading descriptors and maybe loop unrolling.
> > > 
> > > I'm not quite understand why unrolling help in this case.
> > The amount of operations in this loop is tiny but not tiny
> > enough for gcc to decide it's free. Take a look
> > 
> > 
> > #include <stdio.h>
> > 
> > struct desc {
> >          unsigned long long a;
> >          unsigned len;
> >          unsigned short type;
> >          unsigned short flags;
> > };
> > #ifdef UNROLL
> > __attribute__((optimize("unroll-loops")))
> > #endif
> > int foo(struct desc *desc)
> > {
> >       unsigned short aflags = 0xffff;
> >       unsigned short oflags = 0x0;
> > 
> >       int i;
> > 
> >          for (i=0; i < 4; ++i) {
> >               aflags &= desc[i].flags;
> >               oflags |= desc[i].flags;
> >          }
> > 
> >          return aflags == oflags;
> > }
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> 
> 
> I think you mean without -DUNROLL.


Since there's a jump I think you are right.

> 
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   48 8d 47 0e             lea    0xe(%rdi),%rax
> >     4:   31 c9                   xor    %ecx,%ecx
> >     6:   48 83 c7 4e             add    $0x4e,%rdi
> >     a:   be ff ff ff ff          mov    $0xffffffff,%esi
> >     f:   0f b7 10                movzwl (%rax),%edx
> >    12:   48 83 c0 10             add    $0x10,%rax
> >    16:   21 d6                   and    %edx,%esi
> >    18:   09 d1                   or     %edx,%ecx
> >    1a:   48 39 f8                cmp    %rdi,%rax
> >    1d:   75 f0                   jne    f <foo+0xf>
> >    1f:   31 c0                   xor    %eax,%eax
> >    21:   66 39 ce                cmp    %cx,%si
> >    24:   0f 94 c0                sete   %al
> >    27:   c3                      retq
> > 
> > 
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> > 
> > Disassembly of section .text:
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   0f b7 47 1e             movzwl 0x1e(%rdi),%eax
> >     4:   0f b7 77 2e             movzwl 0x2e(%rdi),%esi
> >     8:   0f b7 4f 0e             movzwl 0xe(%rdi),%ecx
> >     c:   89 c2                   mov    %eax,%edx
> >     e:   09 f0                   or     %esi,%eax
> >    10:   21 f2                   and    %esi,%edx
> >    12:   09 c8                   or     %ecx,%eax
> >    14:   21 ca                   and    %ecx,%edx
> >    16:   0f b7 4f 3e             movzwl 0x3e(%rdi),%ecx
> >    1a:   09 c8                   or     %ecx,%eax
> >    1c:   21 ca                   and    %ecx,%edx
> >    1e:   66 39 c2                cmp    %ax,%dx
> >    21:   0f 94 c0                sete   %al
> >    24:   0f b6 c0                movzbl %al,%eax
> >    27:   c3                      retq
> > 
> > 
> > See? gcc just does not unroll agressively enough.
> 
> 
> Ok, so this is just for avoiding branches.

Which is important for a variety of reasons, e.g. linux adds in tricks
to block speculation so branches might not be predicted even on intel.
A pipeline stall on each read isn't good for speed.

> 
> > 
> > > > Something like:
> > > > 
> > > > __attribute__((optimize("unroll-loops")))
> > > > bool fast_get_descriptors(vq, desc)
> > > > {
> > > > 
> > > >         copy X descriptors
> > > >         u16 aflags = 0xffff;
> > > >         u16 oflags = 0x0;
> > > >         for (i = 0; i < X; ++i) {
> > > >                 aflags &= desc[i].flags;
> > > >                 oflags |= desc[i].flags;
> > > >         }
> > > >         /* all flags must be same */
> > > >         if (aflags != oflags)
> > > >                 return false;
> > > > 
> > > >         return is_avail(vq->wrapcount, aflags);
> > > > }
> > > 
> > > This may work. It should be a little bit less efficient but without any
> > > assumption of driver if I understand correctly.
> > > 
> > > 
> > > > 
> > > > I also think 32 is very aggressive, I would start with 8.
> > > 
> > > Maybe, or adding some heuristic to predict it.
> > > 
> > > Thanks
> > The reason I say 8 is because it's two cache lines which
> > is typically what cpu pulls in anyway (1+linear pfetch).
> > Anything more might have more cost as it might
> > put more pressure on the cache. Further
> > static sized short loops can be unrolled by compiler if
> > you ask it nicely, which saves a bunch of branching,
> > speculation etc. Dynamic sized ones can't,
> > you end up with extra math and branches in the loop.
> 
> 
> You're probably right here, but need some benchmark to check.
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > With above, you could try:
> > > > 
> > > >         if (fast_get_descriptors(vq, desc))
> > > >              [last_avail, last_avail + 32] is avail [1]
> > > >         else if (desc_is_avail(last_avail + 1) % size))
> > > >              last_avail is avail
> > > >          else
> > > >                 no new
> > > > 
> > > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

Reply via email to