On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
<[email protected]> wrote:
>>>> [ ... thread cleanup ... ]
>>>>
>>>> Don't hesitate to ask if you need me to debug or even ftrace something.
>>>
>>>
>>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>>> for 2 consecutive days now).
>>> Can you please try the following patch, which works on top of mine,
>>> and enable ftrace before running balance:
>>>
>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>
>>> Enable ftrace:
>>>
>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>>> you can use larger buffer size, even better
>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>
>>>     $ run balance... wait until it finishes with IO error or the
>>> patch's printk message shows up in dmesg/syslog
>>>
>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>
>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>
>>> Then send is some_file.txt for debugging, hopefully it will give some
>>> useful information. Note that it might produce tons of messages,
>>> depending on how long it takes for you to hit the BUG_ON.
>>>
>>> Thanks a lot for this.
>>
>>
>> I'm compiling it now (using your v2 of the friendpaste diff).
>>
>> I took the liberty to add a tracing_off() right before the return -EIO
>> so that the trace tail ends exactly at the right place.
>>
>> Last time I tried to use ftrace to diagnose the bug we're trying to
>> fix, the system crashes so hard that usually it's complicated to get
>> the trace contents written somewhere before the system is unusable.
>> But I'll eventually work around it by using
>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to another
>> machine over the LAN.
>>
>> This series of bugs are so easy to trigger on my system that we'll
>> hopefully get something useful out of the trace. I guess that's a good
>> thing !
>
>
> So, this time it took a little over an hour to get the crash, but it did
> reach the -EIO condition eventually.
> The ftrace log (2M gzipped) is available here :
> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>
> The associated kernel log is as follows :
>
> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 246
> or sacrifice child
> [ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB,
> anon-rss:943764kB, file-rss:0kB
> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 227
> or sacrifice child
> [ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB,
> anon-rss:870704kB, file-rss:60kB
> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 207
> or sacrifice child
> [ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB,
> anon-rss:793272kB, file-rss:48kB
> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 201
> or sacrifice child
> [ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB,
> anon-rss:772172kB, file-rss:4kB
> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120
> seconds.
> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 2,
> seq 0 action 1
> [ 7081.967784] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2872:
> errno=-5 IO failure
>
> The OOM conditions are unrelated, this is an rrdtool cron script that
> apparently takes too much memory from time to time, I'll disable it for
> further tests to get noise out of the way.
>
> If I can be of some more help, just ask.

That's great Stéphane.
I'll take a look tomorrow morning at the logs and then get back to you.

Thanks a lot for that.

>
> Thanks,
>
> --
> Stéphane.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to