Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111

Peter Crosthwaite Sun, 06 Sep 2015 17:49:37 -0700

On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
<[email protected]> wrote:
> On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
>> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
>> <[email protected]> wrote:
>> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
>> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <[email protected]> 
>> >> wrote:
>> >> > On 4 September 2015 at 18:20, Richard Purdie
>> >> > <[email protected]> wrote:
>> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
>> >> >>> > <[email protected]> wrote:
>> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> >> >>> > > overrunning. I can add the asserts but I think it would just 
>> >> >>> > > confirm
>> >> >>> > > this.
>> >> >>> >
>> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
>> >> >>>
>> >> >>> I've now confirmed that it does indeed trigger the assert in
>> >> >>> smc91c111_receive().
>> >> >>
>> >> >> I just tried an experiment where I put:
>> >> >>
>> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
>> >> >>         return -1;
>> >> >>
>> >> >> into smc91c111_receive() and my reproducer stops reproducing the
>> >> >> problem.
>> >>
>> >> Does it just stop the crash or does it eliminate the problem
>> >> completely with a fully now-working network?
>> >
>> > It stops the crash, the network works great.
>> >
>> >> >> I also noticed can_receive() could also have a check on buffer
>> >> >> availability. Would one of these changes be the correct fix here?
>> >> >
>> >> > The interesting question is why smc91c111_allocate_packet() doesn't
>> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
>> >> > shared between the tx and rx buffers, so how could we both have
>> >> > already filled the rx_fifo and have a spare packet for the allocate
>> >> > function to return?
>> >>
>> >> Maybe this:
>> >>
>> >>             case 5: /* Release.  */
>> >>                 smc91c111_release_packet(s, s->packet_num);
>> >>                 break;
>> >>
>> >> The guest is able to free an allocated packet without the accompanying
>> >> pop of tx/rx fifo. This may suggest some sort of guest error?
>> >>
>> >> The fix depends on the behaviour of the real hardware. If that MMIO op
>> >> is supposed to dequeue the corresponding queue entry then we may need
>> >> to patch that logic to do search the queues and dequeue it. Otherwise
>> >> we need to find out the genuine length of the rx queue, and clamp it
>> >> without something like Richards patch. There are a few other bits and
>> >> pieces that suggest the guest can have independent control of the
>> >> queues and allocated buffers but i'm confused as to how the rx fifo
>> >> length can get up to 10 in any case.
>> >
>> > I think I have a handle on what is going on. smc91c111_release_packet()
>> > changes s->allocated() but not rx_fifo. can_receive() only looks at
>> > s->allocated. We can trigger new network packets to arrive from
>> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
>> > *before* we change rx_fifo and this can loop.
>> >
>> > The patch below which explicitly orders the qemu_flush_queued_packets()
>> > call resolved the test case I was able to reproduce this problem in.
>> >
>> > So there are three ways to fix this, either can_receive() needs to check
>> > both s->allocated() and rx_fifo,
>>
>> This is probably the winner for me.
>>
>> > or the code is more explicit about when
>> > qemu_flush_queued_packets() is called (as per my patch below), or the
>> > case 4 where smc91c111_release_packet() and then
>> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
>> > which also works, albeit with more ugly code.
>
> It seems can_receive isn't enough, we'd need to put some checks into
> receive itself since once can_receive says "yes", multiple packets can
> arrive to _receive without further checks of can_receive.


This doesn't sound right. There are other network controllers that
rely of can_receive catching all cases properly. Is this a regression?
Looking at logs, I see some refactoring of QEMU net framework around
June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
away?

> I've either
> messed up my previous test or been lucky.
>
> I tested an assert in _recieve() which confirms it can be called when
> can_receive() says it isn't ready.
>

A backtrace of this would be handy.

What is your replicator? I have core-image-minimal handy but it has no
scp or sshd so all I can think of to stress network is wget, but that
doesn't replicate.

Regards,
Peter

> If we return -1 in _receive, the code will stop sending packets and all
> works as it should, it recovers just fine. So I think that is looking
> like the correct fix. I'd note that it already effectively has half this
> check in the allocate_packet call, its just missing the rx_fifo_len one.
>
> Cheers,
>
> Richard
>

Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111

Reply via email to