> > What accounting in particular is upset? Is it things like > select_idle_sibling() that thinks the thread is idle and tries to place > tasks there? > The major issue that we saw was, certain work load causes the idle cpu to never wakeup and schedule again even when there are runnable threads in there. If I remember correctly, this happened when the sibling had only one cpu intensive task and did not enter the pick_next_task for a long time. There were other situations as well which caused this prolonged idle state on the cpu. One was when pick_next_task was called on the sibling but it always won there because vruntime was not progressing on the idle cpu.
Having a coresched idle makes sure that the idle thread is not overloaded. Also vruntime moves forward and tsk vruntime comparison across cpus works when we normalize. > It should be possible to change idle_cpu() to not report a forced-idle > CPU as idle. I agree. If we can identify all the places the idle thread is considered special and also account for the vruntime progress for force idle, this should be a better approach compared to coresched idle thread per cpu. > > (also; it should be possible to optimize select_idle_sibling() for the > core-sched case specifically) > We haven't seen this because, most of our micro test cases did not have more threads than the cpus. Thanks for pointing this out, we shall cook some tests to observe this behavior. Thanks, Vineeth