On Wed, Aug 14, 2019 at 5:09 AM Eelco Chaudron <echau...@redhat.com> wrote:
>
>
>
> On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
>
> <SNIP>
>
> >>>> I see a rather high number of afxdp_cq_skip, which should to my
> >>>> knowledge never happen?
> >>>
> >>> I tried to investigate this previously, but didn't find anything
> >>> suspicious.
> >>> So, for my knowledge, this should never happen too.
> >>> However, I only looked at the code without actually running, because
> >>> I had no
> >>> HW available for testing.
> >>>
> >>> While investigation and stress-testing virtual ports I found few
> >>> issues with
> >>> missing locking inside the kernel, so there is no trust for kernel
> >>> part of XDP
> >>> implementation from my side. I'm suspecting that there are some
> >>> other bugs in
> >>> kernel/libbpf that only could be reproduced with driver mode.
> >>>
> >>> This never happens for virtual ports with SKB mode, so I never saw
> >>> this coverage
> >>> counter being non-zero.
> >>
> >> Did some quick debugging, as something else has come up that needs my
> >> attention :)
> >>
> >> But once I’m in a faulty state and sent a single packet, causing
> >> afxdp_complete_tx() to be called, it tells me 2048 descriptors are
> >> ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
> >> there might be some ring management bug. Maybe consumer and receiver
> >> are equal meaning 0 buffers, but it returns max? I did not look at
> >> the kernel code, so this is just a wild guess :)
> >>
> >> (gdb) p tx_done
> >> $3 = 2048
> >>
> >> (gdb) p umem->cq
> >> $4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
> >> 2047, size = 2048, producer = 0x7f08486b8000, consumer =
> >> 0x7f08486b8040, ring = 0x7f08486b8080}
> >
> > Thanks for debugging!
> >
> > xsk_ring_cons__peek() just returns the difference between cached_prod
> > and cached_cons, but these values are too different:
> >
> > 3830466864 - 3578066899 = 252399965
> >
> > Since this value > requested, it returns requested number (2048).
> >
> > So, the ring is broken. At least broken its 'cached' part. It'll be
> > good
> > to look at *consumer and *producer values to verify the state of the
> > actual ring.
> >
>
> I’ll try to find some more time next week to debug further.
>
> William I noticed your email in xdp-newbies where you mention this
> problem of getting the wrong pointers. Did you ever follow up, or did
> further trouble shooting on the above?

Yes, I posted here
https://www.spinics.net/lists/xdp-newbies/msg00956.html
"Question/Bug about AF_XDP idx_cq from xsk_ring_cons__peek?"

At that time I was thinking about reproducing the problem using the
xdpsock sample code from kernel. But turned out that my reproduction
code is not correct, so not able to show the case we hit here in OVS.

Then I put more similar code logic from OVS to xdpsock, but the problem
does not show up. As a result, I worked around it by marking addr as
"*addr == UINT64_MAX".

I will debug again this week once I get my testbed back.

William
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to