On 6/20/17 5:41 PM, Ben Greear wrote:
> On 06/20/2017 11:05 AM, Michal Kubecek wrote:
>> On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:
>>> On 06/14/2017 03:25 PM, David Ahern wrote:
>>>> On 6/14/17 4:23 PM, Ben Greear wrote:
>>>>> On 06/13/2017 07:27 PM, David Ahern wrote:
>>>>>
>>>>>> Let's try a targeted debug patch. See attached
>>>>>
>>>>> I had to change it to pr_err so it would go to our serial console
>>>>> since the system locked hard on crash,
>>>>> and that appears to be enough to change the timing where we can no
>>>>> longer
>>>>> reproduce the problem.
>>>>
>>>>
>>>> ok, let's figure out which one is doing that. There are 3 debug
>>>> statements. I suspect fib6_del_route is the one setting the state to
>>>> FWS_U. Can you remove the debug prints in fib6_repair_tree and
>>>> fib6_walk_continue and try again?
>>>
>>> We cannot reproduce with just that one printf in the kernel either.  It
>>> must change the timing too much to trigger the bug.
>>
>> You might try trace_printk() which should have less impact (don't forget
>> to enable /proc/sys/kernel/ftrace_dump_on_oops).
> 
> We cannot reproduce with trace_printk() either.

I think that suggests the walker state is set to FWS_U in
fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
triggers the fault -- the null parent (pn = fn->parent). So we have the
2 areas of code that are interacting.

I'm on a road trip through the end of this week with little time to
focus on this problem. I'll get back to you another suggestion when I can.

Reply via email to