On Thu, Jul 23, 2020 at 5:47 PM Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hu...@google.com> wrote:
> >
> > I say that for full disclosure, so you don't wrack your brains
> > too much, when it may still turn out to be a screwup on my part.
>
> Sounds unlikely.
>
> If that patch applied even reasonably closely, I don't see how you'd
> see a list corruption that wasn't due to the patch.
>
> You'd have had to use the wrong spinlock by mistake due to munging it,
> or something crazy like that.
>
> The main list-handling change is
>
>  (a) open-coding of that finish_wait()
>
>  (b) slightly different heuristics for removal in the wakeup function
>
> where (a) was because my original version of finishing the wait needed
> to do that return code checking.
>
> So a normal "finish_wait()" just does
>
>         list_del_init(&wait->entry);
>
> where-as my open-coded one replaced that with
>
>         if (!list_empty(&wait->entry)) {
>                 list_del(&wait->entry);
>                 ret = -EINTR;
>         }
>
> and apart from that "set return to -EINTR because nobody woke us up",
> it also uses just a regular "list_del()" rather than a
> "list_del_init()". That causes the next/prev field to be poisoned
> rather than re-initialized. But that second change was because the
> list entry is on the stack, and we're not touching it any more and are
> about to return, so I removed the "init" part.
>
> Anyway, (a) really looks harmless. Unless the poisoning now triggers
> some racy debug test that had been hidden by the "init". Hmm.
>
> In contrast, (b) means that the likely access patterns of irqs
> removing the wait entry from the list might be very different from
> before. The old "autoremove" semantics would only remove the entry
> from the list when try_to_wake_up() actually woke things up. Now, a
> successful bit state _always_ removes it, which was kind of the point.
> But it might cause very different list handling patterns.
>
> All the actual list handling looks "obviously safe" because it's
> protected by the spinlock, though...
>
> If you do get oopses with the new patch too, try to send me a copy,
> and maybe I'll stare at exactly where it happens register contents and
> go "aah".

This new version is doing much better: many hours to go, but all
machines have got beyond the danger point where yesterday's version
was crashing - phew!

Hugh

Reply via email to