Quoting Nathan Lynch (2020-08-07 02:05:09) > Hi everyone, > > Michael Ellerman <m...@ellerman.id.au> writes: > > Greg Kurz <gr...@kaod.org> writes: > >> On Tue, 04 Aug 2020 23:35:10 +1000 > >> Michael Ellerman <m...@ellerman.id.au> wrote: > >>> Spinning forever seems like a bad idea, but as has been demonstrated at > >>> least twice now, continuing when we don't know the state of the other > >>> CPU can lead to straight up crashes. > >>> > >>> So I think I'm persuaded that it's preferable to have the kernel stuck > >>> spinning rather than oopsing. > >>> > >> > >> +1 > >> > >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My > >>> first instinct is no, if we're stuck here for 20s a stack trace would be > >>> good. But then we will probably hit that on some big and/or heavily > >>> loaded machine. > >>> > >>> So possibly we should call cond_resched() but have some custom logic in > >>> the loop to print a warning if we are stuck for more than some > >>> sufficiently long amount of time. > >> > >> How long should that be ? > > > > Yeah good question. > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > machine. And we can probably test on some other big machines. > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > large systems? I know he was concerned about the 20s timeout of the > > softlockup detector. > > Maybe I'm not quite clear what this is referring to, but I don't think > stop-self/query-stopped-state latency increases with processor count, at > least not on PowerVM. And IIRC I was concerned with the earlier patch's > potential for causing the softlockup watchdog to rightly complain by > polling the stopped state without ever scheduling away. > > The fact that smp_query_cpu_stopped() kind of collapses the two distinct > results from the query-cpu-stopped-state RTAS call into one return value > may make it harder than necessary to reason about the questions around > cond_resched() and whether to warn. > > Sorry to pull this stunt but I have had some code sitting in a neglected > branch that I think gets the logic around this right. > > What we should have is a simple C wrapper for the RTAS call that reflects the > architected inputs and outputs: > > ================================================================ > (-- rtas.c --) > > /** > * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. > * @hwcpu: Identifies the processor thread to be queried. > * @status: Pointer to status, valid only on success. > * > * Determine whether the given processor thread is in the stopped > * state. If successful and @status is non-NULL, the thread's status > * is stored to @status. > * > * Return: > * * 0 - Success > * * -1 - Hardware error > * * -2 - Busy, try again later > */ > int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) > { > unsigned int cpu_status; > int token; > int fwrc; > > token = rtas_token("query-cpu-stopped-state"); > > fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu); > if (fwrc != 0) > goto out; > > if (status != NULL) > *status = cpu_status; > out: > return fwrc; > } > ================================================================ > > > And then a utility function that waits for the remote thread to enter > stopped state, with higher-level logic for rescheduling and warning. The > fact that smp_query_cpu_stopped() currently does not handle a -2/busy > status is a bug, fixed below by using rtas_busy_delay(). Note the > justification for the explicit cond_resched() in the outer loop: > > ================================================================ > (-- rtas.h --) > > /* query-cpu-stopped-state CPU_status */ > #define RTAS_QCSS_STATUS_STOPPED 0 > #define RTAS_QCSS_STATUS_IN_PROGRESS 1 > #define RTAS_QCSS_STATUS_NOT_STOPPED 2 > > (-- pseries/hotplug-cpu.c --) > > /** > * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. > */ > static void wait_for_cpu_stopped(unsigned int cpu) > { > unsigned int status; > unsigned int hwcpu; > > hwcpu = get_hard_smp_processor_id(cpu); > > do { > int fwrc; > > /* > * rtas_busy_delay() will yield only if RTAS returns a > * busy status. Since query-cpu-stopped-state can > * yield RTAS_QCSS_STATUS_IN_PROGRESS or > * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded > * period before the target thread stops, we must take > * care to explicitly reschedule while polling. > */ > cond_resched(); > > do { > fwrc = rtas_query_cpu_stopped_state(hwcpu, &status); > } while (rtas_busy_delay(fwrc)); > > if (fwrc == 0) > continue; > > pr_err_ratelimited("query-cpu-stopped-state for " > "thread 0x%x returned %d\n", > hwcpu, fwrc); > goto out; > > } while (status == RTAS_QCSS_STATUS_NOT_STOPPED || > status == RTAS_QCSS_STATUS_IN_PROGRESS); > > if (status != RTAS_QCSS_STATUS_STOPPED) { > pr_err_ratelimited("query-cpu-stopped-state yielded unknown " > "status %d for thread 0x%x\n", > status, hwcpu); > } > out: > return; > } > > [...] > > static void pseries_cpu_die(unsigned int cpu) > { > wait_for_cpu_stopped(cpu); > paca_ptrs[cpu]->cpu_start = 0; > } > ================================================================ > > wait_for_cpu_stopped() should be able to accommodate a time-based > warning if necessary, but speaking as a likely recipient of any bug > reports that would arise here, I'm not convinced of the need and I > don't know what a good value would be. It's relatively easy to sample > the stack of a task that's apparently failing to make progress, plus I > probably would use 'perf probe' or similar to report the inputs and > outputs for the RTAS call.
I think if we make the timeout sufficiently high like 2 minutes or so it wouldn't hurt and if we did seem them it would probably point to an actual bug. But I don't have a strong feeling either way. > > I'm happy to make this a proper submission after I can clean it up and > retest it, or Michael R. is welcome to appropriate it, assuming it's > acceptable. > I've given it a shot with this patch and it seems to be holding up in testing. If we don't think the ~2 minutes warning message is needed I can clean it up to post: https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e I'd likely break the refactoring patches out to a separate patch under Nathan's name since it fixes a separate bug potentially.