On Wed, Aug 12, 2015 at 03:33:23PM +0200, Jiri Olsa wrote: > hi, > we see crashes on older kernel (2.6.32 based) in pick_next_task_fair:
This is a RHEL (6?) kernel, right? So not really anything like an actual 2.6.32. Lemme use a random RHEL6 kernel I found on github instead of staring at a local v2.6.32... https://github.com/dduval/kernel-rhel6 > > ... > #8 [ffff8819335efa20] page_fault at ffffffff8152d375 > [exception RIP: rb_next+1] > RIP: ffffffff81292ae1 RSP: ffff8819335efad8 RFLAGS: 00010046 > RAX: 0000000000000000 RBX: ffff8810b8a96928 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: ffff880ff7602ae0 RDI: 0000000000000010 > RBP: ffff8819335efb28 R8: 0000000000000000 R9: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #9 [ffff8819335efae0] pick_next_task_fair at ffffffff8106c511 > #10 [ffff8819335efb30] schedule at ffffffff81529746 > #11 [ffff8819335efc00] futex_wait_queue_me at ffffffff810b226a > #12 [ffff8819335efc40] futex_wait at ffffffff810b33a0 > #13 [ffff8819335efdb0] do_futex at ffffffff810b4c91 > #14 [ffff8819335efef0] sys_futex at ffffffff810b56cb > #15 [ffff8819335eff80] system_call_fastpath at ffffffff8100b072 > ... > > - pick_next_task_fair calls pick_next_entity > - pick_next_entity calls __pick_first_entity and gets NULL from > cfs_rq->rb_leftmost > - cfs_rq->skip is NULL so it gets through (cfs_rq->skip == se) condition > and calls __pick_next_entity(se) which fails on rb_next(&se->run_node) So that code is a whole lot simpler than what we have upstream, but indeed comparable to the 'simple' branch. > it seems that upstream could fail also via: > - pick_next_task_fair calls pick_next_entity with curr == NULL (simple case) > - __pick_first_entity calls __pick_first_entity and gets NULL from > cfs_rq->rb_leftmost This _should_ be impossible: - if !cfs_rq->nr_running, we'll never get here; - if cfs_rq->nr_running == 1 && prev->sched_class == &fair_sched_class, put_prev_entity() will have put prev back in the rb-tree and we have a leftmost; - if cfs_rq->nr_running >= 1, there are entries in the tree, therefore leftmost must not be NULL. > would attached patch make sense or do I miss some rb_leftmost rules/behaviour? If there is _anything_ in the tree, there must be a leftmost. If the tree is empty, we should not be trying to find a task in it due to nr_running checks prior to attempting to do so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

