Can you check your volume file contents?Maybe it really can't find (or access) a specific volfile ? Best Regards,Strahil Nikolov On Fri, Mar 24, 2023 at 8:07, Diego Zuccato<[email protected]> wrote: In glfsheal-Connection.log I see many lines like: [2023-03-13 23:04:40.241481 +0000] E [MSGID: 104021] [glfs-mgmt.c:586:glfs_mgmt_getspec_cbk] 0-gfapi: failed to get the volume file [{from server}, {errno=2}, {error=File o directory non esistente}]
And *lots* of gfid-mismatch errors in glustershd.log . Couldn't find anything that would prevent heal to start. :( Diego Il 21/03/2023 20:39, Strahil Nikolov ha scritto: > I have no clue. Have you checked for errors in the logs ? Maybe you > might find something useful. > > Best Regards, > Strahil Nikolov > > On Tue, Mar 21, 2023 at 9:56, Diego Zuccato > <[email protected]> wrote: > Killed glfsheal, after a day there were 218 processes, then they got > killed by OOM during the weekend. Now there are no processes active. > Trying to run "heal info" reports lots of files quite quickly but does > not spawn any glfsheal process. And neither does restarting glusterd. > Is there some way to selectively run glfsheal to fix one brick at a > time? > > Diego > > Il 21/03/2023 01:21, Strahil Nikolov ha scritto: > > Theoretically it might help. > > If possible, try to resolve any pending heals. > > > > Best Regards, > > Strahil Nikolov > > > > On Thu, Mar 16, 2023 at 15:29, Diego Zuccato > > <[email protected] <mailto:[email protected]>> wrote: > > In Debian stopping glusterd does not stop brick processes: to stop > > everything (and free the memory) I have to > > systemctl stop glusterd > > killall glusterfs{,d} > > killall glfsheal > > systemctl start glusterd > > [this behaviour hangs a simple reboot of a machine running > glusterd... > > not nice] > > > > For now I just restarted glusterd w/o killing the bricks: > > > > root@str957-clustor00:~# ps aux|grep glfsheal|wc -l ; > systemctl restart > > glusterd ; ps aux|grep glfsheal|wc -l > > 618 > > 618 > > > > No change neither in glfsheal processes nor in free memory :( > > Should I "killall glfsheal" before OOK kicks in? > > > > Diego > > > > Il 16/03/2023 12:37, Strahil Nikolov ha scritto: > > > Can you restart glusterd service (first check that it was not > > modified > > > to kill the bricks)? > > > > > > Best Regards, > > > Strahil Nikolov > > > > > > On Thu, Mar 16, 2023 at 8:26, Diego Zuccato > > > <[email protected] <mailto:[email protected]> > <mailto:[email protected]>> wrote: > > > OOM is just just a matter of time. > > > > > > Today mem use is up to 177G/187 and: > > > # ps aux|grep glfsheal|wc -l > > > 551 > > > > > > (well, one is actually the grep process, so "only" 550 > glfsheal > > > processes. > > > > > > I'll take the last 5: > > > root 3266352 0.5 0.0 600292 93044 ? Sl > 06:55 0:07 > > > /usr/libexec/glusterfs/glfsheal cluster_data > info-summary --xml > > > root 3267220 0.7 0.0 600292 91964 ? Sl > 07:00 0:07 > > > /usr/libexec/glusterfs/glfsheal cluster_data > info-summary --xml > > > root 3268076 1.0 0.0 600160 88216 ? Sl > 07:05 0:08 > > > /usr/libexec/glusterfs/glfsheal cluster_data > info-summary --xml > > > root 3269492 1.6 0.0 600292 91248 ? Sl > 07:10 0:07 > > > /usr/libexec/glusterfs/glfsheal cluster_data > info-summary --xml > > > root 3270354 4.4 0.0 600292 93260 ? Sl > 07:15 0:07 > > > /usr/libexec/glusterfs/glfsheal cluster_data > info-summary --xml > > > > > > -8<-- > > > root@str957-clustor00:~# ps -o ppid= 3266352 > > > 3266345 > > > root@str957-clustor00:~# ps -o ppid= 3267220 > > > 3267213 > > > root@str957-clustor00:~# ps -o ppid= 3268076 > > > 3268069 > > > root@str957-clustor00:~# ps -o ppid= 3269492 > > > 3269485 > > > root@str957-clustor00:~# ps -o ppid= 3270354 > > > 3270347 > > > root@str957-clustor00:~# ps aux|grep 3266345 > > > root 3266345 0.0 0.0 430536 10764 ? Sl > 06:55 0:00 > > > gluster volume heal cluster_data info summary --xml > > > root 3271532 0.0 0.0 6260 2500 pts/1 S+ > 07:21 0:00 > > grep > > > 3266345 > > > root@str957-clustor00:~# ps aux|grep 3267213 > > > root 3267213 0.0 0.0 430536 10644 ? Sl > 07:00 0:00 > > > gluster volume heal cluster_data info summary --xml > > > root 3271599 0.0 0.0 6260 2480 pts/1 S+ > 07:22 0:00 > > grep > > > 3267213 > > > root@str957-clustor00:~# ps aux|grep 3268069 > > > root 3268069 0.0 0.0 430536 10704 ? Sl > 07:05 0:00 > > > gluster volume heal cluster_data info summary --xml > > > root 3271626 0.0 0.0 6260 2516 pts/1 S+ > 07:22 0:00 > > grep > > > 3268069 > > > root@str957-clustor00:~# ps aux|grep 3269485 > > > root 3269485 0.0 0.0 430536 10756 ? Sl > 07:10 0:00 > > > gluster volume heal cluster_data info summary --xml > > > root 3271647 0.0 0.0 6260 2480 pts/1 S+ > 07:22 0:00 > > grep > > > 3269485 > > > root@str957-clustor00:~# ps aux|grep 3270347 > > > root 3270347 0.0 0.0 430536 10672 ? Sl > 07:15 0:00 > > > gluster volume heal cluster_data info summary --xml > > > root 3271666 0.0 0.0 6260 2568 pts/1 S+ > 07:22 0:00 > > grep > > > 3270347 > > > -8<-- > > > > > > Seems glfsheal is spawning more processes. > > > I can't rule out a metadata corruption (or at least a > desync), > > but it > > > shouldn't happen... > > > > > > Diego > > > > > > Il 15/03/2023 20:11, Strahil Nikolov ha scritto: > > > > If you don't experience any OOM , you can focus on > the heals. > > > > > > > > 284 processes of glfsheal seems odd. > > > > > > > > Can you check the ppid for 2-3 randomly picked ? > > > > ps -o ppid= <pid> > > > > > > > > Best Regards, > > > > Strahil Nikolov > > > > > > > > On Wed, Mar 15, 2023 at 9:54, Diego Zuccato > > > > <[email protected] > <mailto:[email protected]> <mailto:[email protected]> > > <mailto:[email protected]>> wrote: > > > > I enabled it yesterday and that greatly reduced > memory > > pressure. > > > > Current volume info: > > > > -8<-- > > > > Volume Name: cluster_data > > > > Type: Distributed-Replicate > > > > Volume ID: a8caaa90-d161-45bb-a68c-278263a8531a > > > > Status: Started > > > > Snapshot Count: 0 > > > > Number of Bricks: 45 x (2 + 1) = 135 > > > > Transport-type: tcp > > > > Bricks: > > > > Brick1: clustor00:/srv/bricks/00/d > > > > Brick2: clustor01:/srv/bricks/00/d > > > > Brick3: clustor02:/srv/bricks/00/q (arbiter) > > > > [...] > > > > Brick133: clustor01:/srv/bricks/29/d > > > > Brick134: clustor02:/srv/bricks/29/d > > > > Brick135: clustor00:/srv/bricks/14/q (arbiter) > > > > Options Reconfigured: > > > > performance.quick-read: off > > > > cluster.entry-self-heal: on > > > > cluster.data-self-heal-algorithm: full > > > > cluster.metadata-self-heal: on > > > > cluster.shd-max-threads: 2 > > > > network.inode-lru-limit: 500000 > > > > performance.md-cache-timeout: 600 > > > > performance.cache-invalidation: on > > > > features.cache-invalidation-timeout: 600 > > > > features.cache-invalidation: on > > > > features.quota-deem-statfs: on > > > > performance.readdir-ahead: on > > > > cluster.granular-entry-heal: enable > > > > features.scrub: Active > > > > features.bitrot: on > > > > cluster.lookup-optimize: on > > > > performance.stat-prefetch: on > > > > performance.cache-refresh-timeout: 60 > > > > performance.parallel-readdir: on > > > > performance.write-behind-window-size: 128MB > > > > cluster.self-heal-daemon: enable > > > > features.inode-quota: on > > > > features.quota: on > > > > transport.address-family: inet > > > > nfs.disable: on > > > > performance.client-io-threads: off > > > > client.event-threads: 1 > > > > features.scrub-throttle: normal > > > > diagnostics.brick-log-level: ERROR > > > > diagnostics.client-log-level: ERROR > > > > config.brick-threads: 0 > > > > cluster.lookup-unhashed: on > > > > config.client-threads: 1 > > > > cluster.use-anonymous-inode: off > > > > diagnostics.brick-sys-log-level: CRITICAL > > > > features.scrub-freq: monthly > > > > cluster.data-self-heal: on > > > > cluster.brick-multiplex: on > > > > cluster.daemon-log-level: ERROR > > > > -8<-- > > > > > > > > htop reports that memory usage is up to 143G, > there are 602 > > > tasks and > > > > 5232 threads (~20 running) on clustor00, 117G/49 > tasks/1565 > > > threads on > > > > clustor01 and 126G/45 tasks/1574 threads on > clustor02. > > > > I see quite a lot (284!) of glfsheal processes > running on > > > clustor00 (a > > > > "gluster v heal cluster_data info summary" is > running > > on clustor02 > > > > since > > > > yesterday, still no output). Shouldn't be just > one per > > brick? > > > > > > > > Diego > > > > > > > > Il 15/03/2023 08:30, Strahil Nikolov ha scritto: > > > > > Do you use brick multiplexing ? > > > > > > > > > > Best Regards, > > > > > Strahil Nikolov > > > > > > > > > > On Tue, Mar 14, 2023 at 16:44, Diego Zuccato > > > > > <[email protected] > <mailto:[email protected]> > > <mailto:[email protected]> <mailto:[email protected]> > > > <mailto:[email protected]>> wrote: > > > > > Hello all. > > > > > > > > > > Our Gluster 9.6 cluster is showing increasing > > problems. > > > > > Currently it's composed of 3 servers (2x > Intel Xeon > > > 4210 [20 > > > > cores dual > > > > > thread, total 40 threads], 192GB RAM, 30x > HGST > > > HUH721212AL5200 > > > > [12TB]), > > > > > configured in replica 3 arbiter 1. Using > Debian > > > packages from > > > > Gluster > > > > > 9.x latest repository. > > > > > > > > > > Seems 192G RAM are not enough to handle > 30 data > > bricks + 15 > > > > arbiters > > > > > and > > > > > I often had to reload glusterfsd because > glusterfs > > > processed > > > > got killed > > > > > for OOM. > > > > > On top of that, performance have been > quite bad, > > especially > > > > when we > > > > > reached about 20M files. On top of that, > one of > > the servers > > > > have had > > > > > mobo issues that resulted in memory > errors that > > > corrupted some > > > > > bricks fs > > > > > (XFS, it required "xfs_reparir -L" to fix). > > > > > Now I'm getting lots of "stale file handle" > > errors and > > > other > > > > errors > > > > > (like directories that seem empty from the > > client but still > > > > containing > > > > > files in some bricks) and auto healing seems > > unable to > > > complete. > > > > > > > > > > Since I can't keep up continuing to > manually fix > > all the > > > > issues, I'm > > > > > thinking about backup+destroy+recreate > strategy. > > > > > > > > > > I think that if I reduce the number of > bricks per > > > server to just 5 > > > > > (RAID1 of 6x12TB disks) I might resolve RAM > > issues - at the > > > > cost of > > > > > longer heal times in case a disk fails. Am I > > right or it's > > > > useless? > > > > > Other recommendations? > > > > > Servers have space for another 6 disks. > Maybe those > > > could be > > > > used for > > > > > some SSDs to speed up access? > > > > > > > > > > TIA. > > > > > > > > > > -- > > > > > Diego Zuccato > > > > > DIFA - Dip. di Fisica e Astronomia > > > > > Servizi Informatici > > > > > Alma Mater Studiorum - Università di Bologna > > > > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > > > > tel.: +39 051 20 95786 > > > > > ________ > > > > > > > > > > > > > > > > > > > > Community Meeting Calendar: > > > > > > > > > > Schedule - > > > > > Every 2nd and 4th Tuesday at 14:30 IST / > 09:00 UTC > > > > > Bridge: > https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>> > > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>>> > > > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>> > > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>>>> > > > > > Gluster-users mailing list > > > > > [email protected] > <mailto:[email protected]> > > <mailto:[email protected]> > > > <mailto:[email protected]> > > <mailto:[email protected]> > > > > <mailto:[email protected]> > > > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>> > > > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>>> > > > > > > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>> > > > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>>>> > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Diego Zuccato > > > > DIFA - Dip. di Fisica e Astronomia > > > > Servizi Informatici > > > > Alma Mater Studiorum - Università di Bologna > > > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > > > tel.: +39 051 20 95786 > > > > ________ > > > > > > > > > > > > > > > > Community Meeting Calendar: > > > > > > > > Schedule - > > > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > Bridge: https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>> > > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>> > > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > > <https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk>>>> > > > > Gluster-users mailing list > > > > [email protected] > <mailto:[email protected]> > > <mailto:[email protected]> > <mailto:[email protected]> > > > <mailto:[email protected]> > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>> > > > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>> > > > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> > > <https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users>>>> > > > > > > > > > > -- > > > Diego Zuccato > > > DIFA - Dip. di Fisica e Astronomia > > > Servizi Informatici > > > Alma Mater Studiorum - Università di Bologna > > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > > tel.: +39 051 20 95786 > > > > > > > -- > > Diego Zuccato > > DIFA - Dip. di Fisica e Astronomia > > Servizi Informatici > > Alma Mater Studiorum - Università di Bologna > > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > > tel.: +39 051 20 95786 > > > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
