Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
Hi, For this same problem I once got into single user after 2 weeks of utilisation and killed all, umounted all FS except root which is ext4, rmmod'ed all modules and see by yourself: http://i39.tinypic.com/2rrrjtl.jpg For those who want it textually: 15 days uptime 10 user mode processes (systemd, top and bash) 2 GB memory usage, 4 GB total memory, 17 MB cache, 32 KB buffers. On Sun, Jul 6, 2014 at 10:58 AM, Marc MERLIN m...@merlins.org wrote: On Sat, Jul 05, 2014 at 07:43:18AM -0700, Marc MERLIN wrote: On Sat, Jul 05, 2014 at 09:47:09AM -0400, Andrew E. Mileski wrote: On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. Really? The userland tool should only send ioctls to the kernel, I really can't see how it would cause the kernel code to panic or not. gargamel:~# btrfs --version Btrfs v3.14.1 which is the latest in debian unstable. As an update, after 1.7 days of scrubbing, the system has started getting sluggish, I'm getting synchronization problems/crashes in some of my tools that talk to serial ports (likely due to mini deadlocks in the kernel), and I'm now getting a few btrfs hangs. Predictably, it died yesterday afternoon after going into memory death (it was answering pings, but userspace was dead, and even sysrq-o did not respond, I had to power cycle the power outlet). This happened just before my 3rd scrub finished, so I'm now 2 out of 2: running scrub on my 3 filesystems kills the system half way through the 3rd scrub. This is the last memory log that reached the disk: http://marc.merlins.org/tmp/btrfs-oom2.txt Do those logs point to any possible culprit, or a kernel memory leak cannot be pointed to its source because the kernel loses track of who requested the memory that leaked? Excerpt here: Sat Jul 5 14:25:04 PDT 2014 total used free sharedbuffers cached Mem: 78947927712384 182408 0 28 227480 -/+ buffers/cache:7484876 409916 Swap: 15616764 463732 15153032 Userspace is using 345MB according to ps Sat Jul 5 14:25:04 PDT 2014 MemTotal:7894792 kB MemFree: 184556 kB MemAvailable: 269568 kB Buffers: 28 kB Cached: 228164 kB SwapCached:18296 kB Active: 178196 kB Inactive: 187016 kB Active(anon): 70068 kB Inactive(anon):71100 kB Active(file): 108128 kB Inactive(file): 115916 kB Unevictable:5624 kB Mlocked:5624 kB SwapTotal: 15616764 kB SwapFree: 15152768 kB Dirty: 17716 kB Writeback: 516 kB AnonPages:140588 kB Mapped:21940 kB Shmem: 688 kB Slab: 181708 kB SReclaimable: 59808 kB SUnreclaim: 121900 kB KernelStack:4728 kB PageTables: 8480 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:19564160 kB Committed_AS:1633204 kB VmallocTotal: 34359738367 kB VmallocUsed: 358996 kB VmallocChunk: 34359281468 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 144920 kB DirectMap2M: 7942144 kB Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On Sat, Jul 05, 2014 at 07:43:18AM -0700, Marc MERLIN wrote: On Sat, Jul 05, 2014 at 09:47:09AM -0400, Andrew E. Mileski wrote: On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. Really? The userland tool should only send ioctls to the kernel, I really can't see how it would cause the kernel code to panic or not. gargamel:~# btrfs --version Btrfs v3.14.1 which is the latest in debian unstable. As an update, after 1.7 days of scrubbing, the system has started getting sluggish, I'm getting synchronization problems/crashes in some of my tools that talk to serial ports (likely due to mini deadlocks in the kernel), and I'm now getting a few btrfs hangs. Predictably, it died yesterday afternoon after going into memory death (it was answering pings, but userspace was dead, and even sysrq-o did not respond, I had to power cycle the power outlet). This happened just before my 3rd scrub finished, so I'm now 2 out of 2: running scrub on my 3 filesystems kills the system half way through the 3rd scrub. This is the last memory log that reached the disk: http://marc.merlins.org/tmp/btrfs-oom2.txt Do those logs point to any possible culprit, or a kernel memory leak cannot be pointed to its source because the kernel loses track of who requested the memory that leaked? Excerpt here: Sat Jul 5 14:25:04 PDT 2014 total used free sharedbuffers cached Mem: 78947927712384 182408 0 28 227480 -/+ buffers/cache:7484876 409916 Swap: 15616764 463732 15153032 Userspace is using 345MB according to ps Sat Jul 5 14:25:04 PDT 2014 MemTotal:7894792 kB MemFree: 184556 kB MemAvailable: 269568 kB Buffers: 28 kB Cached: 228164 kB SwapCached:18296 kB Active: 178196 kB Inactive: 187016 kB Active(anon): 70068 kB Inactive(anon):71100 kB Active(file): 108128 kB Inactive(file): 115916 kB Unevictable:5624 kB Mlocked:5624 kB SwapTotal: 15616764 kB SwapFree: 15152768 kB Dirty: 17716 kB Writeback: 516 kB AnonPages:140588 kB Mapped:21940 kB Shmem: 688 kB Slab: 181708 kB SReclaimable: 59808 kB SUnreclaim: 121900 kB KernelStack:4728 kB PageTables: 8480 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:19564160 kB Committed_AS:1633204 kB VmallocTotal: 34359738367 kB VmallocUsed: 358996 kB VmallocChunk: 34359281468 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 144920 kB DirectMap2M: 7942144 kB Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. My server has two filesystems of ~25 TiB, and 4 GB of RAM (15% used). A hardware upgrade is imminent, just waiting for the backup to complete. ~~ Andrew E. Mileski -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. My server has two BTRFS filesystems of ~25 TiB each, and 4 GB of RAM (15% used). A hardware upgrade is imminent; just waiting for the backup to complete. ~~ Andrew E. Mileski -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On Sat, Jul 05, 2014 at 09:47:09AM -0400, Andrew E. Mileski wrote: On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. Really? The userland tool should only send ioctls to the kernel, I really can't see how it would cause the kernel code to panic or not. gargamel:~# btrfs --version Btrfs v3.14.1 which is the latest in debian unstable. As an update, after 1.7 days of scrubbing, the system has started getting sluggish, I'm getting synchronization problems/crashes in some of my tools that talk to serial ports (likely due to mini deadlocks in the kernel), and I'm now getting a few btrfs hangs. A lot of my RAM seems gone now (1GB free but only 500MB actually used by user space), but the system is still up for now, just somewhat sluggish. I've been told meminfo, slabinfo and zoneinfo may be useful to debug that, so I'll attach them. I see an interesting line about nf_conntrack and vmalloc in there. When my scrub is done, I'll reboot and disable nf_conntrack just to see what happens, and I'll also wait a bit to see how stable the RAM is before I start more scrubs to see if scrubs do create memmory leaks. slabinfo (attached) doesn't seem to show that btrfs owns a lot of memory, but maybe I'm not sure how to read it. gargamel:~# free total used free sharedbuffers cached Mem: 78947927370872 523920 0 28 564272 -/+ buffers/cache:68065721088220 Swap: 15616764 483792 15132972 That one could be interesting, although just a a symptom of the memory pressure: [144323.739212] nf_conntrack: falling back to vmalloc. [144323.757139] nf_conntrack: falling back to vmalloc. [144323.778092] nf_conntrack: falling back to vmalloc. [144323.797785] nf_conntrack: falling back to vmalloc. Is it possible for btrfs to be hung on getting memory allocations? Is a 2mn wait even possible for the right memory block to become available? [149171.560084] INFO: task btrfs:17366 blocked for more than 120 seconds. [149171.582499] Not tainted 3.14.0-amd64-i915-preempt-20140216 #2 [149171.604251] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [149171.630523] btrfs D 88006897c980 0 17366 18362 0x0080 [149171.654267] 880083a51720 0086 880083a51fd8 88006897c450 [149171.679265] 000141c0 88006897c450 88021f3941c0 88006897c450 [149171.704141] 880083a517c0 810fe7a2 0002 880083a51730 [149171.729093] Call Trace: [149171.738754] [810fe7a2] ? wait_on_page_read+0x3c/0x3c [149171.759582] [8160d2a1] schedule+0x73/0x75 [149171.776663] [8160d446] io_schedule+0x60/0x7a [149171.794400] [810fe7b0] sleep_on_page+0xe/0x12 [149171.812292] [8160d6d8] __wait_on_bit+0x48/0x7a [149171.830401] [810fe750] wait_on_page_bit+0x7a/0x7c [149171.849307] [8108514a] ? autoremove_wake_function+0x34/0x34 [149171.871229] [81246320] read_extent_buffer_pages+0x1bf/0x204 [149171.892813] [81223dc0] ? free_root_pointers+0x5b/0x5b [149171.912642] [81224ac2] btree_read_extent_buffer_pages.constprop.45+0x66/0x100 [149171.938737] [81225a17] read_tree_block+0x2f/0x47 [149171.957308] [8120eb66] read_block_for_search.isra.26+0x24a/0x287 [149171.980128] [812103a7] btrfs_search_slot+0x4f4/0x6bb [149171.999679] [8127174d] scrub_stripe+0x43d/0xc9e [149172.017930] [81085116] ? finish_wait+0x65/0x65 [149172.035901] [81272084] scrub_chunk.isra.9+0xd6/0x10d [149172.055614] [8127232f] scrub_enumerate_chunks+0x274/0x418 [149172.076546] [81085100] ? finish_wait+0x4f/0x65 [149172.094506] [81272a73] btrfs_scrub_dev+0x254/0x3cb [149172.113493] [8116e31e] ? __mnt_want_write+0x62/0x78 [149172.132714] [81256314] btrfs_ioctl+0x1114/0x24bd [149172.15] [81015f07] ? paravirt_sched_clock+0x9/0xd [149172.170794] [810164ad] ? sched_clock+0x9/0xb [149172.188107] [8107b2fe] ? sched_clock_cpu+0x1a/0xc5 [149172.207380] [811409bd] ? cache_alloc+0x1c/0x29b [149172.226446] [81140d2b] ? kmem_cache_alloc_node+0xef/0x179 [149172.247033] [8160f97b] ? _raw_spin_unlock+0x17/0x2a [149172.266018] [81163ffe] do_vfs_ioctl+0x3d2/0x41d [149172.283939] [8116c1f0] ? __fget+0x6f/0x79 [149172.300268] [81164099] SyS_ioctl+0x50/0x7b [149172.316818] [8161642d] system_call_fastpath+0x1a/0x1f
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On 2014-07-05 10:43 AM, Marc MERLIN wrote: On Sat, Jul 05, 2014 at 09:47:09AM -0400, Andrew E. Mileski wrote: On 2014-07-03 9:19 PM, Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. I didn't see any mention of the btrfs utility version in this thread (I may be blind though). My server was suffering from frequent panics upon scrub / defrag / balance, until I updated the btrfs utility. That resolved all my issues. Really? The userland tool should only send ioctls to the kernel, I really can't see how it would cause the kernel code to panic or not. gargamel:~# btrfs --version Btrfs v3.14.1 which is the latest in debian unstable. It reached the point that I got so frustrated that I built the utilities from git, as an up-to-date package wasn't available at the time. I'm currently using v3.14.2 but v3.14.1 was the version that resolved my server's panics. I've not looked into the code very deeply yet, so I can't theorize on why it resolved my server's panics. I can only state with complete certainty that it did. I too was somewhat surprised. However, it seems we've probably now ruled it out in your case. ~~ Andrew E. Mileski -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On Fri, Jul 04, 2014 at 02:33:06PM +1000, Russell Coker wrote: On Thu, 3 Jul 2014 18:19:38 Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. Is there any correlation between such problems and BTRFS operations such as creating snapshots or running a scrub/balance? So far I don't have time data of when it happened, but I'm running new code to do better logging and hopefully give me a hint of when this is happening. However since it takes 6 to 48H to trigger, it'll take a little bit before I can report back. Thanks Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
Hi, (2014/07/04 13:33), Russell Coker wrote: On Thu, 3 Jul 2014 18:19:38 Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. Is there any correlation between such problems and BTRFS operations such as creating snapshots or running a scrub/balance? Were you running scrub, Marc? http://marc.merlins.org/tmp/btrfs-oom.txt: === ... [90621.895922] [ 8034] 0 8034 1315 164 5 46 0 btrfs-scrub ... === In this case, you would hit kernel memory leak. However, I can't find who is the root cause from this log. Marc, do you change - software and its setting, - operations, - hardware configuration, or any other, just before detecting first OOM? You have 8GB RAM and there is plenty of swap space. === [90621.895719] 2021665 pages RAM ... [90621.895718] Free swap = 15230536kB === Here are the avaliable memory of for each OOM-killer. 1st OOM: === [90622.074758] Out of memory: Kill process 11452 (mh) score 2 or sacrifice child [90622.074760] Killed process 11452 (mh) total-vm:66208kB, anon-rss:0kB, file-rss:872kB [90622.425826] rfx-xpl-static invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 ~~ It failed to acquire order=0 (2^0=1) page. So it's not kernel-memory-fragmentation case. Since __GFP_IO(0x80) and __GFP_FS(0x80) is set in gfp_mask, it can swap out anon/file pages to swap/filesystems to prepare free memories. [90622.425829] rfx-xpl-static cpuset=/ mems_allowed=0 [90622.425832] CPU: 2 PID: 748 Comm: rfx-xpl-static Not tainted 3.14.0-amd64-i915-preempt-20140216 #2 [90622.425833] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012 [90622.425834] 8801414a79d8 8160a06d 8801434b2050 [90622.425838] 8801414a7a68 81607078 8160dd00 [90622.425841] 8801414a7a08 810501b4 8801414a7a48 8109cb05 [90622.425844] Call Trace: [90622.425846] [8160a06d] dump_stack+0x4e/0x7a [90622.425851] [81607078] dump_header+0x7f/0x206 [90622.425854] [8160dd00] ? mutex_unlock+0x16/0x18 [90622.425857] [810501b4] ? put_online_cpus+0x6c/0x6e [90622.425861] [8109cb05] ? rcu_oom_notify+0xb3/0xc6 [90622.425865] [81101a7f] oom_kill_process+0x6e/0x30e [90622.425869] [811022b6] out_of_memory+0x42e/0x461 [90622.425872] [81106dfe] __alloc_pages_nodemask+0x673/0x854 [90622.425876] [8113b654] alloc_pages_vma+0xd1/0x116 [90622.425880] [81130ab7] read_swap_cache_async+0x74/0x13b [90622.425883] [81130cc1] swapin_readahead+0x143/0x152 [90622.425886] [810fede9] ? find_get_page+0x69/0x75 [90622.425889] [81122adf] handle_mm_fault+0x56b/0x9b0 [90622.425892] [81612de6] __do_page_fault+0x381/0x3cd [90622.425895] [81078cfc] ? wake_up_state+0x12/0x12 [90622.425899] [8115d770] ? path_put+0x1e/0x21 [90622.425903] [81612e57] do_page_fault+0x25/0x27 [90622.425906] [816103f8] page_fault+0x28/0x30 [90622.425910] Mem-Info: [90622.425910] Node 0 DMA per-cpu: [90622.425913] CPU0: hi:0, btch: 1 usd: 0 [90622.425914] CPU1: hi:0, btch: 1 usd: 0 [90622.425915] CPU2: hi:0, btch: 1 usd: 0 [90622.425916] CPU3: hi:0, btch: 1 usd: 0 [90622.425916] Node 0 DMA32 per-cpu: [90622.425919] CPU0: hi: 186, btch: 31 usd: 24 [90622.425920] CPU1: hi: 186, btch: 31 usd: 1 [90622.425921] CPU2: hi: 186, btch: 31 usd: 0 [90622.425922] CPU3: hi: 186, btch: 31 usd: 0 [90622.425923] Node 0 Normal per-cpu: [90622.425924] CPU0: hi: 186, btch: 31 usd: 0 [90622.425925] CPU1: hi: 186, btch: 31 usd: 0 [90622.425926] CPU2: hi: 186, btch: 31 usd: 0 [90622.425928] CPU3: hi: 186, btch: 31 usd: 0 [90622.425932] active_anon:57 inactive_anon:92 isolated_anon:0 [90622.425932] active_file:987 inactive_file:1232 isolated_file:0 [90622.425932] unevictable:1389 dirty:590 writeback:1 unstable:0 [90622.425932] free:25102 slab_reclaimable:9147 slab_unreclaimable:30944 There are few anon/file, in other word, reclaimable pages. The system would be almost full of kernel memory. As I said, kernel memory leak would happen here. [90622.425932] mapped:771 shmem:104 pagetables:1487 bounce:0 [90622.425932] free_cma:0 [90622.425933] Node 0 DMA free:15360kB min:128kB low:160kB high:192kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
Thank you for your answer. I'll put the conclusion and question at the top for easier reading: So, should I understand that 1) I have enough RAM in my system but all of it disappears, apparently claimed by the kernel and not released 2) this could be a kernel memory leak in btrfs or somewhere else, there is no good way to know If so, in a case like this, is there additional output I can capture to figure out how the memory is lost and help find out which part of the kernel is eating the memory without releasing it? While btrfs is likely to blame, for now it's really just a guess and it would be good to confirm. On Fri, Jul 04, 2014 at 03:23:41PM +0900, Satoru Takeuchi wrote: Is there any correlation between such problems and BTRFS operations such as creating snapshots or running a scrub/balance? Were you running scrub, Marc? Yes, I was due to the other problem I was discussing on the list. I wanted to know if scrub would find any problem (it did not). I think I'll now try to read every file of every filesystem to see what btrfs does (this will take a while, that's around 100 million files). But the last times I had this OOM problem with 3.15.1 it was happening within 6 hours sometimes, and I was not starting scrub every time the system booted, so scrub may be partially responsible but it's not the core problem. (Also I run scrub on this system every few weeks and it hadn't OOM crashed inthe past) Marc, do you change - software and its setting, - operations, - hardware configuration, or any other, just before detecting first OOM? Those are 3 good questions, I asked myself the same thing. From what I remember though all I did was going from 3.14 to 3.15. However this machine has many cronjobs, it does rsyncs to and from remote systems, it has btrfs send/receive going to and from it, and snapshots every hour. Those are not new, but if any of them changed in a small way, I guess they could trigger bugs. You have 8GB RAM and there is plenty of swap space. Correct. === [90621.895719] 2021665 pages RAM ... [90621.895718] Free swap = 15230536kB === Here are the avaliable memory of for each OOM-killer. 1st OOM: === [90622.074758] Out of memory: Kill process 11452 (mh) score 2 or sacrifice child [90622.074760] Killed process 11452 (mh) total-vm:66208kB, anon-rss:0kB, file-rss:872kB [90622.425826] rfx-xpl-static invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 ~~ It failed to acquire order=0 (2^0=1) page. So it's not kernel-memory-fragmentation case. Since __GFP_IO(0x80) and __GFP_FS(0x80) is set in gfp_mask, it can swap out anon/file pages to swap/filesystems to prepare free memories. Thanks for explaining. [90622.425932] active_anon:57 inactive_anon:92 isolated_anon:0 [90622.425932] active_file:987 inactive_file:1232 isolated_file:0 [90622.425932] unevictable:1389 dirty:590 writeback:1 unstable:0 [90622.425932] free:25102 slab_reclaimable:9147 slab_unreclaimable:30944 There are few anon/file, in other word, reclaimable pages. The system would be almost full of kernel memory. As I said, kernel memory leak would happen here. [90622.425932] mapped:771 shmem:104 pagetables:1487 bounce:0 [90622.425932] free_cma:0 [90622.425933] Node 0 DMA free:15360kB min:128kB low:160kB high:192kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15360kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes ~~ all_unreclaimable? == yes means page reclaim work do my best and there is nothing to do any more. Understood. I moved my question that was here at the top. Thank you, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
Marc MERLIN posted on Fri, 04 Jul 2014 07:24:16 -0700 as excerpted: So, should I understand that 1) I have enough RAM in my system but all of it disappears, apparently claimed by the kernel and not released 2) this could be a kernel memory leak in btrfs or somewhere else, there is no good way to know Generally speaking if you're running out of RAM and swap is unused, it's because something is leaking memory that cannot be swapped. The kernel is the most likely suspect as kernel memory doesn't swap, especially so if the OOM-killer kills everything it can and the problem is still there. So it's almost certainly the kernel, and btrfs is a good candidate, but I'm not a dev, and at least from that vantage point I didn't see enough information to conclusively pin it on btrfs. If the devs can't either, then either turning on additional debugging options or further debugging patches would seem to be in order. So the above seems correct from here, yes. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. OOM comes in, even though I have 0 swap used and AFAIK all my RAM isn't gone, it then fails to kill enough stuff and eventually it dies like this: [80943.542209] Swap cache stats: add 814596, delete 814595, find 2567491/2808869 [80943.565106] Free swap = 15612448kB [80943.577607] Total swap = 15616764kB [80943.589766] 2021665 pages RAM [80943.600281] 0 pages HighMem/MovableOnly [80943.613284] 28468 pages reserved [80943.624330] 0 pages hwpoisoned [80943.634824] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [80943.659669] [ 918] 0 918 8550 5 236 -1000 udevd [80943.684789] [ 8022] 0 8022 30740 5 89 -1000 auditd [80943.710154] [ 8253] 0 8253 18130 6 123 -1000 sshd [80943.735024] [12001] 0 12001 8540 5 241 -1000 udevd [80943.760152] [18969] 0 18969 8540 5 223 -1000 udevd [80943.785293] Kernel panic - not syncing: Out of memory and no killable processes... Here is my more recent capture on 3.14 when I was able to catch it before the panic and dump a bunch of sysrq data. http://marc.merlins.org/tmp/btrfs-oom.txt Things to note in that log: [90621.895715] 2962 total pagecache pages [90621.895716] 5 pages in swap cache [90621.895717] Swap cache stats: add 145004, delete 144999, find 3314901/3316382 [90621.895718] Free swap = 15230536kB [90621.895718] Total swap = 15616764kB [90621.895718] Total swap = 15616764kB [90621.895719] 2021665 pages RAM [90621.895720] 0 pages HighMem/MovableOnly [90621.895720] 28468 pages reserved I'm not a VM person so I don't know how to read this, but am I out of RAM but not out of swap (since clearly none was used), or am I out of a specific memory region that is causing me problems? I'm not 100% certain btrfs is to blame, but somehow it's suspect when ugprading to 3.15 and getting btrfs problems then caused my 3 months running fine 3.14.0 kernel also to die with the same OOM problems. Then again, I understand it could be red herring. Suggestions either way are appreciated :) I tried raising this: gargamel:~# echo 100 /proc/sys/vm/swappiness But so far I have too much unused RAM for any swap to be touched. gargamel:~# free total used free sharedbuffers cached Mem: 789479249385962956196 0 23962909204 -/+ buffers/cache:20269965867796 Swap: 15616764 0 15616764 The log is too big to paste here, but you can grep it for: [90817.715833] SysRq : Show Memory [90817.715833] SysRq : Show Memory [90817.715833] SysRq : Show Memory [90893.571151] SysRq : Show backtrace of all active CPUs [90921.781599] SysRq : Show Blocked State [91075.976611] SysRq : Show State [91406.972046] SysRq : Terminate All Tasks [91410.771584] SysRq : Emergency Remount R/O [91413.222483] SysRq : Emergency Sync [91430.316955] SysRq : Power Off ^^^ note the kernel was wedged enough that Power Off didn't work, apparently because it failed to swap: [91447.490142] CPU: 3 PID: 48 Comm: kswapd0 Not tainted 3.14.0-amd64-i915-preempt-20140216 #2 [91447.490143] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3806 08/20/2012 [91447.490145] task: 8802126b6490 ti: 8802126e4000 task.ti: 8802126e4000 [91447.490146] RIP: 0010:[810898f5] [810898f5] do_raw_spin_lock+0x23/0x27 Right after OOM started kicking in, console showed apparent deadlocks in btrfs. But is it possible that btrfs is then also eating all my memory somehow? You can find the long details here: http://marc.merlins.org/tmp/btrfs-oom.txt [90801.680821] INFO: task btrfs-transacti:3433 blocked for more than 120 seconds. [90801.712345] Not tainted 3.14.0-amd64-i915-preempt-20140216 #2 [90801.734394] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [90801.882691] btrfs-transacti D 88021387e800 0 3433 2 0x [90801.904863] 88020b20de10 0046 88020b20dfd8 88021387e2d0 [90801.928448] 000141c0 88021387e2d0 880211e94800 880029c00dc0 [90801.952015] 8802009d1f28 8802009d1ed0 88020b20de20 [90801.975701] Call Trace: [90801.984443] [8160d2a1] schedule+0x73/0x75 [90802.000438] [8122b575] btrfs_commit_transaction+0x330/0x849 [90802.021140] [81085116] ? finish_wait+0x65/0x65 [90802.038438] [81227c48] transaction_kthread+0xf8/0x1ab [90802.057571] [81227b50] ? btrfs_cleanup_transaction+0x43f/0x43f [90802.079092] [8106bc62] kthread+0xae/0xb6 [90802.094838] [8106bbb4] ? __kthread_parkme+0x61/0x61
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On Thu, 3 Jul 2014 18:19:38 Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. Is there any correlation between such problems and BTRFS operations such as creating snapshots or running a scrub/balance? Back in ~3.10 days I had serious problems with BTRFS memory use when removing multiple snapshots or balancing. But at about 3.13 they all seemed to get fixed. I usually didn't have a kernel panic when I had such problems (although I sometimes had a system lock up solid such that I couldn't even determine what it's problem was). Usually the Oom handler started killing big processes such as chromium when it shouldn't have needed to. Note that I haven't verified that the BTRFS memory use is reasonable in all such situations. Merely that it doesn't use enough to kill my systems. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html