On Fri, Oct 23, 2015 at 12:03 AM, Filipe Manana <fdman...@kernel.org> wrote:
> On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
> <stephane_bt...@lesimple.fr> wrote:
>>>>> [ ... thread cleanup ... ]
>>>>>
>>>>> Don't hesitate to ask if you need me to debug or even ftrace something.
>>>>
>>>>
>>>> Thanks Stéphane. I haven't seen that crash yet (still running tests
>>>> for 2 consecutive days now).
>>>> Can you please try the following patch, which works on top of mine,
>>>> and enable ftrace before running balance:
>>>>
>>>> Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
>>>>
>>>> Enable ftrace:
>>>>
>>>>     $ echo > /sys/kernel/debug/tracing/trace
>>>>     $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
>>>>     $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
>>>> you can use larger buffer size, even better
>>>>     $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
>>>>     $ echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>>
>>>>     $ run balance... wait until it finishes with IO error or the
>>>> patch's printk message shows up in dmesg/syslog
>>>>
>>>>     $ echo 0 > /sys/kernel/debug/tracing/tracing_on
>>>>
>>>>     $ cat /sys/kernel/debug/tracing/trace > some_file.txt
>>>>
>>>> Then send is some_file.txt for debugging, hopefully it will give some
>>>> useful information. Note that it might produce tons of messages,
>>>> depending on how long it takes for you to hit the BUG_ON.
>>>>
>>>> Thanks a lot for this.
>>>
>>>
>>> I'm compiling it now (using your v2 of the friendpaste diff).
>>>
>>> I took the liberty to add a tracing_off() right before the return -EIO
>>> so that the trace tail ends exactly at the right place.
>>>
>>> Last time I tried to use ftrace to diagnose the bug we're trying to
>>> fix, the system crashes so hard that usually it's complicated to get
>>> the trace contents written somewhere before the system is unusable.
>>> But I'll eventually work around it by using
>>> /sys/kernel/debug/tracing/trace_pipe to send the trace live to another
>>> machine over the LAN.
>>>
>>> This series of bugs are so easy to trigger on my system that we'll
>>> hopefully get something useful out of the trace. I guess that's a good
>>> thing !
>>
>>
>> So, this time it took a little over an hour to get the crash, but it did
>> reach the -EIO condition eventually.
>> The ftrace log (2M gzipped) is available here :
>> http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
>>
>> The associated kernel log is as follows :
>>
>> [ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 246
>> or sacrifice child
>> [ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB,
>> anon-rss:943764kB, file-rss:0kB
>> [ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 227
>> or sacrifice child
>> [ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB,
>> anon-rss:870704kB, file-rss:60kB
>> [ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 207
>> or sacrifice child
>> [ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB,
>> anon-rss:793272kB, file-rss:48kB
>> [ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 201
>> or sacrifice child
>> [ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB,
>> anon-rss:772172kB, file-rss:4kB
>> [ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120
>> seconds.
>> [ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
>> [ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 2,
>> seq 0 action 1
>> [ 7081.967784] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2872:
>> errno=-5 IO failure
>>
>> The OOM conditions are unrelated, this is an rrdtool cron script that
>> apparently takes too much memory from time to time, I'll disable it for
>> further tests to get noise out of the way.
>>
>> If I can be of some more help, just ask.
>
> That's great Stéphane.
> I'll take a look tomorrow morning at the logs and then get back to you.
>
> Thanks a lot for that.

Stéphane, can you do the balance again but with the following new debug patch:

https://friendpaste.com/3NXzDFGcK4Hxi0yiypMgRO

(I added your tracing_off(); call)

It applies on top of my fix.
Also, are you by chance using qgroups (btrfs quotas)? If you are I
think I know what's the problem, but either way please run with this
new debug patch, so that I can confirm my ideas.

Many thanks, much appreciated!

>
>>
>> Thanks,
>>
>> --
>> Stéphane.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to