On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie <richard.pur...@linuxfoundation.org> wrote: > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote: >> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie >> <richard.pur...@linuxfoundation.org> wrote: >> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote: >> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.mayd...@linaro.org> >> >> wrote: >> >> > On 4 September 2015 at 18:20, Richard Purdie >> >> > <richard.pur...@linuxfoundation.org> wrote: >> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote: >> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote: >> >> >>> > On 4 September 2015 at 12:24, Richard Purdie >> >> >>> > <richard.pur...@linuxfoundation.org> wrote: >> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be >> >> >>> > > overrunning. I can add the asserts but I think it would just >> >> >>> > > confirm >> >> >>> > > this. >> >> >>> > >> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis. >> >> >>> >> >> >>> I've now confirmed that it does indeed trigger the assert in >> >> >>> smc91c111_receive(). >> >> >> >> >> >> I just tried an experiment where I put: >> >> >> >> >> >> if (s->rx_fifo_len >= NUM_PACKETS) >> >> >> return -1; >> >> >> >> >> >> into smc91c111_receive() and my reproducer stops reproducing the >> >> >> problem. >> >> >> >> Does it just stop the crash or does it eliminate the problem >> >> completely with a fully now-working network? >> > >> > It stops the crash, the network works great. >> > >> >> >> I also noticed can_receive() could also have a check on buffer >> >> >> availability. Would one of these changes be the correct fix here? >> >> > >> >> > The interesting question is why smc91c111_allocate_packet() doesn't >> >> > fail in this situation. We only have NUM_PACKETS worth of storage, >> >> > shared between the tx and rx buffers, so how could we both have >> >> > already filled the rx_fifo and have a spare packet for the allocate >> >> > function to return? >> >> >> >> Maybe this: >> >> >> >> case 5: /* Release. */ >> >> smc91c111_release_packet(s, s->packet_num); >> >> break; >> >> >> >> The guest is able to free an allocated packet without the accompanying >> >> pop of tx/rx fifo. This may suggest some sort of guest error? >> >> >> >> The fix depends on the behaviour of the real hardware. If that MMIO op >> >> is supposed to dequeue the corresponding queue entry then we may need >> >> to patch that logic to do search the queues and dequeue it. Otherwise >> >> we need to find out the genuine length of the rx queue, and clamp it >> >> without something like Richards patch. There are a few other bits and >> >> pieces that suggest the guest can have independent control of the >> >> queues and allocated buffers but i'm confused as to how the rx fifo >> >> length can get up to 10 in any case. >> > >> > I think I have a handle on what is going on. smc91c111_release_packet() >> > changes s->allocated() but not rx_fifo. can_receive() only looks at >> > s->allocated. We can trigger new network packets to arrive from >> > smc91c111_release_packet() which calls qemu_flush_queued_packets() >> > *before* we change rx_fifo and this can loop. >> > >> > The patch below which explicitly orders the qemu_flush_queued_packets() >> > call resolved the test case I was able to reproduce this problem in. >> > >> > So there are three ways to fix this, either can_receive() needs to check >> > both s->allocated() and rx_fifo, >> >> This is probably the winner for me. >> >> > or the code is more explicit about when >> > qemu_flush_queued_packets() is called (as per my patch below), or the >> > case 4 where smc91c111_release_packet() and then >> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter >> > which also works, albeit with more ugly code. > > It seems can_receive isn't enough, we'd need to put some checks into > receive itself since once can_receive says "yes", multiple packets can > arrive to _receive without further checks of can_receive.
This doesn't sound right. There are other network controllers that rely of can_receive catching all cases properly. Is this a regression? Looking at logs, I see some refactoring of QEMU net framework around June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go away? > I've either > messed up my previous test or been lucky. > > I tested an assert in _recieve() which confirms it can be called when > can_receive() says it isn't ready. > A backtrace of this would be handy. What is your replicator? I have core-image-minimal handy but it has no scp or sshd so all I can think of to stress network is wget, but that doesn't replicate. Regards, Peter > If we return -1 in _receive, the code will stop sending packets and all > works as it should, it recovers just fine. So I think that is looking > like the correct fix. I'd note that it already effectively has half this > check in the allocate_packet call, its just missing the rx_fifo_len one. > > Cheers, > > Richard >