Re: [Gluster-users] OOM Kills glustershd process in 3.10.1

Edvin Ekström Thu, 27 Apr 2017 07:38:35 -0700

I've encountered the same issue, however in my case it seem to have been
caused by a bug in the kernel that was present between 4.4.0-58 -
4.4.0-63 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842),
seeing how you are running 4.4.0-62 I would suggest upgrading and see if
the error persists.


Edvin Ekström,

On 2017-04-26 09:09, Amudhan P wrote:
> I did volume start force and now self-heal daemon is up on the node
> which was down.
>
> But bitrot has triggered crawling process on all node now,  why was it
> crawling disk again?  if the process is running already.
>
> [output from bitd.log]
> [2017-04-13 06:01:23.930089] I
> [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in
> volfile, continuing
> [2017-04-26 06:51:46.998935] I [MSGID: 100030]
> [glusterfsd.c:2460:main] 0-/usr/local/sbin/glusterfs: Started running
> /usr/local/sbin/glusterfs version 3.10.1 (args:
> /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p
> /var/lib/glusterd/bitd/run/bitd.pid -l /var/log/glusterfs/bitd.log -S
> /var/run/gluster/02f1dd346d47b9006f9bf64e347338fd.socket
> --global-timer-wheel)
> [2017-04-26 06:51:47.002732] I [MSGID: 101190]
> [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started
> thread with index 1
>
>
> On Tue, Apr 25, 2017 at 11:01 PM, Amudhan P <amudha...@gmail.com
> <mailto:amudha...@gmail.com>> wrote:
>
>     Yes, I have enabled bitrot process and it's currently running
>     signer process in some nodes.
>
>     Disabling and enabling bitrot doesn't makes difference it will
>     start crawl process again right.
>
>
>     On Tuesday, April 25, 2017, Atin Mukherjee <amukh...@redhat.com
>     <mailto:amukh...@redhat.com>> wrote:
>     >
>     >
>     > On Tue, Apr 25, 2017 at 9:22 PM, Amudhan P <amudha...@gmail.com
>     <mailto:amudha...@gmail.com>> wrote:
>     >>
>     >> Hi Pranith,
>     >> if I restart glusterd service in the node alone will it work.
>     bcoz I feel that doing volume force start will trigger bitrot
>     process to crawl disks in all nodes.
>     >
>     > Have you enabled bitrot? If not then the process will not be in
>     existence. As a workaround you can always disable this option
>     before executing volume start force. Please note volume start
>     force doesn't affect any running processes.
>     >  
>     >>
>     >> yes, rebalance fix layout is on process.
>     >> regards
>     >> Amudhan
>     >>
>     >> On Tue, Apr 25, 2017 at 9:15 PM, Pranith Kumar Karampuri
>     <pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote:
>     >>>
>     >>> You can restart the process using:
>     >>> gluster volume start <volname> force
>     >>>
>     >>> Did shd on this node heal a lot of data? Based on the kind of
>     memory usage it showed, seems like there is a leak.
>     >>>
>     >>>
>     >>> Sunil,
>     >>>        Could you find if there any leaks in this particular
>     version that we might have missed in our testing?
>     >>>
>     >>> On Tue, Apr 25, 2017 at 8:37 PM, Amudhan P
>     <amudha...@gmail.com <mailto:amudha...@gmail.com>> wrote:
>     >>>>
>     >>>> Hi,
>     >>>> In one of my node glustershd process is killed due to OOM and
>     this happened only in one node out of 40 node cluster.
>     >>>> Node running on Ubuntu 16.04.2.
>     >>>> dmesg output:
>     >>>> [Mon Apr 24 17:21:38 2017] nrpe invoked oom-killer:
>     gfp_mask=0x26000c0, order=2, oom_score_adj=0
>     >>>> [Mon Apr 24 17:21:38 2017] nrpe cpuset=/ mems_allowed=0
>     >>>> [Mon Apr 24 17:21:38 2017] CPU: 0 PID: 12626 Comm: nrpe Not
>     tainted 4.4.0-62-generic #83-Ubuntu
>     >>>> [Mon Apr 24 17:21:38 2017]  0000000000000286 00000000fc26b170
>     ffff88048bf27af0 ffffffff813f7c63
>     >>>> [Mon Apr 24 17:21:38 2017]  ffff88048bf27cc8 ffff88082a663c00
>     ffff88048bf27b60 ffffffff8120ad4e
>     >>>> [Mon Apr 24 17:21:38 2017]  ffff88087781a870 ffff88087781a860
>     ffffea0011285a80 0000000100000001
>     >>>> [Mon Apr 24 17:21:38 2017] Call Trace:
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff813f7c63>]
>     dump_stack+0x63/0x90
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff8120ad4e>]
>     dump_header+0x5a/0x1c5
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff811926c2>]
>     oom_kill_process+0x202/0x3c0
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81192ae9>]
>     out_of_memory+0x219/0x460
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81198a5d>]
>     __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81198e56>]
>     __alloc_pages_nodemask+0x286/0x2a0
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81198f0b>]
>     alloc_kmem_pages_node+0x4b/0xc0
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff8107ea5e>]
>     copy_process+0x1be/0x1b70
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff8122d013>] ?
>     __fd_install+0x33/0xe0
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81713d01>] ?
>     release_sock+0x111/0x160
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff810805a0>]
>     _do_fork+0x80/0x360
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff8122429c>] ?
>     SyS_select+0xcc/0x110
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff81080929>]
>     SyS_clone+0x19/0x20
>     >>>> [Mon Apr 24 17:21:38 2017]  [<ffffffff818385f2>]
>     entry_SYSCALL_64_fastpath+0x16/0x71
>     >>>> [Mon Apr 24 17:21:38 2017] Mem-Info:
>     >>>> [Mon Apr 24 17:21:38 2017] active_anon:553952
>     inactive_anon:206987 isolated_anon:0
>     >>>>                             active_file:3410764
>     inactive_file:3460179 isolated_file:0
>     >>>>                             unevictable:4914 dirty:212868
>     writeback:0 unstable:0
>     >>>>                             slab_reclaimable:386621
>     slab_unreclaimable:31829
>     >>>>                             mapped:6112 shmem:211
>     pagetables:6178 bounce:0
>     >>>>                             free:82623 free_pcp:213 free_cma:0
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA free:15880kB min:32kB
>     low:40kB high:48kB active_anon:0kB inactive_anon:0k
>     >>>> B active_file:0kB inactive_file:0kB unevictable:0kB
>     isolated(anon):0kB isolated(file):0kB present:15964kB manag
>     >>>> ed:15880kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
>     shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
>     >>>>  kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
>     free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:
>     >>>> 0kB pages_scanned:0 all_unreclaimable? yes
>     >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 1868 31944
>     31944 31944
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32 free:133096kB
>     min:3948kB low:4932kB high:5920kB active_anon:170764kB in
>     >>>> active_anon:206296kB active_file:394236kB
>     inactive_file:525288kB unevictable:980kB isolated(anon):0kB isolated(
>     >>>> file):0kB present:2033596kB managed:1952976kB mlocked:980kB
>     dirty:1552kB writeback:0kB mapped:3904kB shmem:724k
>     >>>> B slab_reclaimable:502176kB slab_unreclaimable:8916kB
>     kernel_stack:1952kB pagetables:1408kB unstable:0kB bounce
>     >>>> :0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>     writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>     >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 30076 30076
>     30076
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal free:181516kB
>     min:63600kB low:79500kB high:95400kB active_anon:2045044
>     >>>> kB inactive_anon:621652kB active_file:13248820kB
>     inactive_file:13315428kB unevictable:18676kB isolated(anon):0kB
>     isolated(file):0kB present:31322112kB managed:30798036kB
>     mlocked:18676kB dirty:849920kB writeback:0kB mapped:20544kB
>     shmem:120kB slab_reclaimable:1044308kB slab_unreclaimable:118400kB
>     kernel_stack:33792kB pagetables:23304kB unstable:0kB bounce:0kB
>     free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
>     pages_scanned:0 all_unreclaimable? no
>     >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 0 0 0
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB
>     0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
>     >>>>  1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32: 18416*4kB (UME)
>     7480*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*
>     >>>> 512kB 0*1024kB 0*2048kB 0*4096kB = 133504kB
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal: 44972*4kB (UMEH)
>     13*8kB (EH) 13*16kB (H) 13*32kB (H) 8*64kB (H) 2*128
>     >>>> kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 181384kB
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0
>     hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>     >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0
>     hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>     >>>> [Mon Apr 24 17:21:38 2017] 6878703 total pagecache pages
>     >>>> [Mon Apr 24 17:21:38 2017] 2484 pages in swap cache
>     >>>> [Mon Apr 24 17:21:38 2017] Swap cache stats: add 3533870,
>     delete 3531386, find 3743168/4627884
>     >>>> [Mon Apr 24 17:21:38 2017] Free swap  = 14976740kB
>     >>>> [Mon Apr 24 17:21:38 2017] Total swap = 15623164kB
>     >>>> [Mon Apr 24 17:21:38 2017] 8342918 pages RAM
>     >>>> [Mon Apr 24 17:21:38 2017] 0 pages HighMem/MovableOnly
>     >>>> [Mon Apr 24 17:21:38 2017] 151195 pages reserved
>     >>>> [Mon Apr 24 17:21:38 2017] 0 pages cma reserved
>     >>>> [Mon Apr 24 17:21:38 2017] 0 pages hwpoisoned
>     >>>> [Mon Apr 24 17:21:38 2017] [ pid ]   uid  tgid total_vm    
>      rss nr_ptes nr_pmds swapents oom_score_adj name
>     >>>> [Mon Apr 24 17:21:38 2017] [  566]     0   566    15064    
>      460      33       3     1108             0 systemd
>     >>>> -journal
>     >>>> [Mon Apr 24 17:21:38 2017] [  602]     0   602    23693    
>      182      16       3        0             0 lvmetad
>     >>>> [Mon Apr 24 17:21:38 2017] [  613]     0   613    11241    
>      589      21       3      264         -1000 systemd
>     >>>> -udevd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1381]   100  1381    25081    
>      440      19       3       25             0 systemd
>     >>>> -timesyn
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1447]     0  1447     1100    
>      307       7       3        0             0 acpid
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1449]     0  1449     7252    
>      374      21       3       47             0 cron
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1451]     0  1451    77253    
>      994      19       3       10             0 lxcfs
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1483]     0  1483     6511    
>      413      18       3       42             0 atd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1505]     0  1505     7157    
>      286      18       3       36             0 systemd
>     >>>> -logind
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1508]   104  1508    64099    
>      376      27       4      712             0 rsyslog
>     >>>> d
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1510]   107  1510    10723    
>      497      25       3       45          -900 dbus-da
>     >>>> emon
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1521]     0  1521    68970    
>      178      38       3      170             0 account
>     >>>> s-daemon
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1526]     0  1526     6548    
>      785      16       3       63             0 smartd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1528]     0  1528    54412    
>      146      31       5     1806             0 snapd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1578]     0  1578     3416    
>      335      11       3       24             0 mdadm
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1595]     0  1595    16380    
>      470      35       3      157         -1000 sshd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1610]     0  1610    69295    
>      303      40       4       57             0 polkitd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1618]     0  1618     1306      
>     31       8       3        0             0 iscsid
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1619]     0  1619     1431    
>      877       8       3        0           -17 iscsid
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1624]     0  1624   126363    
>     8027     122       4    22441             0 gluster
>     >>>> d
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1688]     0  1688     4884    
>      430      15       3       46             0 irqbala
>     >>>> nce
>     >>>> [Mon Apr 24 17:21:38 2017] [ 1699]     0  1699     3985    
>      348      13       3        0             0 agetty
>     >>>> [Mon Apr 24 17:21:38 2017] [ 7001]     0  7001   500631  
>      27874     145       5     3356             0 gluster
>     >>>> fsd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 8136]     0  8136   500631  
>      28760     141       5     2390             0 gluster
>     >>>> fsd
>     >>>> [Mon Apr 24 17:21:38 2017] [ 9280]     0  9280   533529  
>      27752     135       5     3200             0 gluster
>     >>>> fsd
>     >>>> [Mon Apr 24 17:21:38 2017] [12626]   111 12626     5991    
>      420      16       3      113             0 nrpe
>     >>>> [Mon Apr 24 17:21:38 2017] [14342]     0 14342   533529  
>      28377     135       5     2176             0 gluster
>     >>>> fsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14361]     0 14361   534063  
>      29190     136       5     1972             0 gluster
>     >>>> fsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14380]     0 14380   533529  
>      28104     136       6     2437             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14399]     0 14399   533529  
>      27552     131       5     2808             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14418]     0 14418   533529  
>      29588     138       5     2697             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14437]     0 14437   517080  
>      28671     146       5     2170             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14456]     0 14456   533529  
>      28083     139       5     3359             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14475]     0 14475   533529  
>      28054     134       5     2954             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14494]     0 14494   533529  
>      28594     135       5     2311             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14513]     0 14513   533529  
>      28911     138       5     2833             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14532]     0 14532   533529  
>      28259     134       6     3145             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14551]     0 14551   533529  
>      27875     138       5     2267             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [14570]     0 14570   484716  
>      28247     142       5     2875             0 glusterfsd
>     >>>> [Mon Apr 24 17:21:38 2017] [27646]     0 27646  3697561  
>     202086    2830      17    16528             0 glusterfs
>     >>>> [Mon Apr 24 17:21:38 2017] [27655]     0 27655   787371  
>      29588     197       6    25472             0 glusterfs
>     >>>> [Mon Apr 24 17:21:38 2017] [27665]     0 27665   689585    
>      605     108       6     7008             0 glusterfs
>     >>>> [Mon Apr 24 17:21:38 2017] [29878]     0 29878   193833  
>      36054     241       4    41182             0 glusterfs
>     >>>> [Mon Apr 24 17:21:38 2017] Out of memory: Kill process 27646
>     (glusterfs) score 17 or sacrifice child
>     >>>> [Mon Apr 24 17:21:38 2017] Killed process 27646 (glusterfs)
>     total-vm:14790244kB, anon-rss:795040kB, file-rss:13304kB
>     >>>> /var/log/glusterfs/glusterd.log
>     >>>> [2017-04-24 11:53:51.359603] I [MSGID: 106006]
>     [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify]
>     0-management: glustershd has disconnected from glusterd.
>     >>>> what would have gone wrong?
>     >>>> regards
>     >>>> Amudhan
>     >>>>
>     >>>> _______________________________________________
>     >>>> Gluster-users mailing list
>     >>>> Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
>     >>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>     >>>
>     >>>
>     >>>
>     >>> --
>     >>> Pranith
>     >>
>     >>
>     >> _______________________________________________
>     >> Gluster-users mailing list
>     >> Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
>     >> http://lists.gluster.org/mailman/listinfo/gluster-users
>     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>     >
>     >
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] OOM Kills glustershd process in 3.10.1

Reply via email to