Hi Strahil, I believe we are using the standard MTU of 1500 (would need to check with the network people to be sure). Does it make a difference?
I'm afraid I don't know about the scheduler - where do I find that? Thank you for the suggestions about turning off performance.read-ahead and performance.readdir-ahead. On Tue, 7 Jan 2020 at 18:08, Strahil <hunter86...@yahoo.com> wrote: > Hi David, > > It's difficult to find anything structured (but it's the same for Linux > and other tech). I use Red Hat's doxumentation, guideds online (crosscheck > the options with official documentation) and experience shared on the > mailing list. > > I don't see anything (iin /var/lib/gluster/groups) that will match your > profile, but I think that you should try with performance.read-ahead and > performance.readdir-ahead 'off' . I have found out a bug (didn't read the > whole stuff) , that might be interesting for you : > > https://bugzilla.redhat.com/show_bug.cgi?id=1601166 > > Also, Arbiter is very important in order to avoid split brain situations > (but based on my experience , issues still can occur) and best the brick > for the Arbiter to be an SSD as it needs to process the metadata as fast as > possible. With v7, there is an option the client to have an Arbiter even > in the cloud (remote arbiter) that is used only when 1 data brick is down. > > Please report the issue with the cache - that should not be like that. > > Are you using Jumbo frames (MTU 9000)? > What is yoir brick's I/O scheduler ? > > Best Regards, > Strahil Nikolov > On Jan 7, 2020 01:34, David Cunningham <dcunning...@voisonics.com> wrote: > > Hi Strahil, > > We may have had a heal since the GFS arbiter node wasn't accessible from > the GFS clients, only from the other GFS servers. Unfortunately we haven't > been able to produce the problem seen in production while testing so are > unsure whether making the GFS arbiter node directly available to clients > has fixed the issue. > > The load on GFS is mainly: > 1. There are a small number of files around 5MB in size which are read > often and change infrequently. > 2. There are a large number of directories which are opened for reading to > read the list of contents frequently. > 3. There are a large number of new files around 5MB in size written > frequently and read infrequently. > > We haven't touched the tuning options as we don't really feel qualified to > tell what needs changed from the default. Do you know of any suitable > guides to get started? > > For some reason performance.cache-size is reported as both 32MB and 128MB. > Is it worth reporting even for version 5.6? > > Here is the "gluster volume info" taken on the first node. Note that the > third node (the arbiter) is currently taken out of the cluster: > Volume Name: gvol0 > Type: Replicate > Volume ID: fb5af69e-1c3e-4164-8b23-c1d7bec9b1b6 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: gfs1:/nodirectwritedata/gluster/gvol0 > Brick2: gfs2:/nodirectwritedata/gluster/gvol0 > Options Reconfigured: > diagnostics.client-log-level: INFO > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > > Thanks for your help and advice. > > > On Sat, 28 Dec 2019 at 17:46, Strahil <hunter86...@yahoo.com> wrote: > > Hi David, > > It seems that I have misread your quorum options, so just ignore that from > my previous e-mail. > > Best Regards, > Strahil Nikolov > On Dec 27, 2019 15:38, Strahil <hunter86...@yahoo.com> wrote: > > Hi David, > > Gluster supports live rolling upgrade, so there is no need to redeploy at > all - but the migration notes should be checked as some features must be > disabled first. > Also, the gluster client should remount in order to bump the gluster > op-version. > > What kind of workload do you have ? > I'm asking as there are predefined (and recommended) settings located at > /var/lib/gluster/groups . > You can check the options for each group and cross-check the options > meaning in the docs before activating a setting. > > I still have a vague feeling that ,during that high-peak of network > bandwidth, there was a heal going on. Have you checked that ? > > Also, sharding is very useful , when you work with large files and the > heal is reduced to the size of the shard. > > N.B.: Once sharding is enabled, DO NOT DISABLE it - as you will loose > your data. > > Using GLUSTER v7.1 (soon on CentOS & Debian) allows using latest > features and optimizations while support from gluster Dev community is > quite active. > > P.S: I'm wondering how 'performance.cache-size' can both be 32 MB and 128 > MB. Please double-check this (maybe I'm reading it wrong on my smartphone) > and if needed raise a bug on bugzilla.redhat.com > > P.S2: Please provide 'gluster volume info' as 'cluster.quorum-type' -> > 'none' is not normal for replicated volumes (arbiters are using in replica > volumes) > > According to the dooutput (otps:// > docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/) > : > > *Note:** Enabling the arbiter feature **automatically** configures* > *client-quorum > to 'auto'. This setting is **not** to be changed.* > > Here is my output (Hyperconverged Virtualization Cluster -> oVirt): > # gluster volume info engine | grep quorum > cluster.quorum-type: auto > cluster.server-quorum-type: server > > Changing quorum is more 'riskier' than other options, so you need to take > necessary measures. I think , we all know what will happen , if the > cluster is out of quorum and you change the quorum settings to more > stringent ones :D > > P.S3: If you decide to reset your gluster volume to the defaults, you can > create a new volume (same type as current one), the get the options for > that volume and put them in a file and then bulk deploy via 'gluster volume > set <Original Volume> group custom-group' , where the file is located > on every gluster server in the '/var/lib/gluster/groups' directory. > Last , get rid of the sample volume. > > Best Regards, > Strahil Nikolov > On Dec 27, 2019 03:22, David Cunningham <dcunning...@voisonics.com> wrote: > > Hi Strahil, > > Our volume options are as below. Thanks for the suggestion to upgrade to > version 6 or 7. We could do that be simply removing the current > installation and installing the new one (since it's not live right now). We > might have to convince the customer that it's likely to succeed though, as > at the moment I think they believe that GFS is not going to work for them. > > Option Value > > ------ ----- > > cluster.lookup-unhashed on > > cluster.lookup-optimize on > > cluster.min-free-disk 10% > > cluster.min-free-inodes 5% > > cluster.rebalance-stats off > > cluster.subvols-per-directory (null) > > cluster.readdir-optimize off > > cluster.rsync-hash-regex (null) > > cluster.extra-hash-regex (null) > > cluster.dht-xattr-name trusted.glusterfs.dht > > cluster.randomize-hash-range-by-gfid off > > cluster.rebal-throttle normal > > cluster.lock-migration off > > cluster.force-migration off > > cluster.local-volume-name (null) > > cluster.weighted-rebalance on > > cluster.switch-pattern (null) > > cluster.entry-change-log on > > cluster.read-subvolume (null) > > cluster.read-subvolume-index -1 > > cluster.read-hash-mode 1 > > cluster.background-self-heal-count 8 > > cluster.metadata-self-heal on > > cluster.data-self-heal on > > cluster.entry-self-heal on > > cluster.self-heal-daemon on > > cluster.heal-timeout 600 > > cluster.self-heal-window-size 1 > > cluster.data-change-log on > > cluster.metadata-change-log on > > cluster.data-self-heal-algorithm (null) > > cluster.eager-lock on > > disperse.eager-lock on > > disperse.other-eager-lock on > > disperse.eager-lock-timeout 1 > > disperse.other-eager-lock-timeout 1 > > cluster.quorum-type none > > cluster.quorum-count (null) > > cluster.choose-local true > > cluster.self-heal-readdir-size 1KB > > cluster.post-op-delay-secs 1 > > cluster.ensure-durability on > > cluster.consistent-metadata no > > cluster.heal-wait-queue-length 128 > > cluster.favorite-child-policy none > > cluster.full-lock yes > > cluster.stripe-block-size 128KB > > cluster.stripe-coalesce true > > diagnostics.latency-measurement off > > diagnostics.dump-fd-stats off > > diagnostics.count-fop-hits off > > diagnostics.brick-log-level INFO > > diagnostics.client-log-level INFO > > diagnostics.brick-sys-log-level CRITICAL > > diagnostics.client-sys-log-level CRITICAL > > diagnostics.brick-logger (null) > > diagnostics.client-logger (null) > > diagnostics.brick-log-format (null) > > diagnostics.client-log-format (null) > > diagnostics.brick-log-buf-size 5 > > diagnostics.client-log-buf-size 5 > > diagnostics.brick-log-flush-timeout 120 > > diagnostics.client-log-flush-timeout 120 > > diagnostics.stats-dump-interval 0 > > diagnostics.fop-sample-interval 0 > > diagnostics.stats-dump-format json > > diagnostics.fop-sample-buf-size 65535 > > diagnostics.stats-dnscache-ttl-sec 86400 > > performance.cache-max-file-size 0 > > performance.cache-min-file-size 0 > > performance.cache-refresh-timeout 1 > > performance.cache-priority > > performance.cache-size 32MB > > performance.io-thread-count 16 > > performance.high-prio-threads 16 > > performance.normal-prio-threads 16 > > performance.low-prio-threads 16 > > performance.least-prio-threads 1 > > performance.enable-least-priority on > > performance.iot-watchdog-secs (null) > > performance.iot-cleanup-disconnected-reqsoff > > performance.iot-pass-through false > > performance.io-cache-pass-through false > > performance.cache-size 128MB > > performance.qr-cache-timeout 1 > > performance.cache-invalidation false > > performance.ctime-invalidation false > > performance.flush-behind on > > performance.nfs.flush-behind on > > performance.write-behind-window-size 1MB > > performance.resync-failed-syncs-after-fsyncoff > > performance.nfs.write-behind-window-size1MB > > performance.strict-o-direct off > > performance.nfs.strict-o-direct off > > performance.strict-write-ordering off > > performance.nfs.strict-write-ordering off > > performance.write-behind-trickling-writeson > > performance.aggregate-size 128KB > > performance.nfs.write-behind-trickling-writeson > > performance.lazy-open yes > > performance.read-after-open yes > > performance.open-behind-pass-through false > > performance.read-ahead-page-count 4 > > performance.read-ahead-pass-through false > > performance.readdir-ahead-pass-through false > > performance.md-cache-pass-through false > > performance.md-cache-timeout 1 > > performance.cache-swift-metadata true > > performance.cache-samba-metadata false > > performance.cache-capability-xattrs true > > performance.cache-ima-xattrs true > > performance.md-cache-statfs off > > performance.xattr-cache-list > > performance.nl-cache-pass-through false > > features.encryption off > > encryption.master-key (null) > > encryption.data-key-size 256 > > encryption.block-size 4096 > > network.frame-timeout 1800 > > network.ping-timeout 42 > > network.tcp-window-size (null) > > network.remote-dio disable > > client.event-threads 2 > > client.tcp-user-timeout 0 > > client.keepalive-time 20 > > client.keepalive-interval 2 > > client.keepalive-count 9 > > network.tcp-window-size (null) > > network.inode-lru-limit 16384 > > auth.allow * > > auth.reject (null) > > transport.keepalive 1 > > server.allow-insecure on > > server.root-squash off > > server.anonuid 65534 > > server.anongid 65534 > > server.statedump-path /var/run/gluster > > server.outstanding-rpc-limit 64 > > server.ssl (null) > > auth.ssl-allow * > > server.manage-gids off > > server.dynamic-auth on > > client.send-gids on > > server.gid-timeout 300 > > server.own-thread (null) > > server.event-threads 1 > > server.tcp-user-timeout 0 > > server.keepalive-time 20 > > server.keepalive-interval 2 > > server.keepalive-count 9 > > transport.listen-backlog 1024 > > ssl.own-cert (null) > > ssl.private-key (null) > > ssl.ca-list (null) > > ssl.crl-path (null) > > ssl.certificate-depth (null) > > ssl.cipher-list (null) > > ssl.dh-param (null) > > ssl.ec-curve (null) > > transport.address-family inet > > performance.write-behind on > > performance.read-ahead on > > performance.readdir-ahead on > > performance.io-cache on > > performance.quick-read on > > performance.open-behind on > > performance.nl-cache off > > performance.stat-prefetch on > > performance.client-io-threads off > > performance.nfs.write-behind on > > performance.nfs.read-ahead off > > performance.nfs.io-cache off > > performance.nfs.quick-read off > > performance.nfs.stat-prefetch off > > performance.nfs.io-threads off > > performance.force-readdirp true > > performance.cache-invalidation false > > features.uss off > > features.snapshot-directory .snaps > > features.show-snapshot-directory off > > features.tag-namespaces off > > network.compression off > > network.compression.window-size -15 > > network.compression.mem-level 8 > > network.compression.min-size 0 > > network.compression.compression-level -1 > > network.compression.debug false > > features.default-soft-limit 80% > > features.soft-timeout 60 > > features.hard-timeout 5 > > features.alert-time 86400 > > features.quota-deem-statfs off > > geo-replication.indexing off > > geo-replication.indexing off > > geo-replication.ignore-pid-check off > > geo-replication.ignore-pid-check off > > features.quota off > > features.inode-quota off > > features.bitrot disable > > debug.trace off > > debug.log-history no > > debug.log-file no > > debug.exclude-ops (null) > > debug.include-ops (null) > > debug.error-gen off > > debug.error-failure (null) > > debug.error-number (null) > > debug.random-failure off > > debug.error-fops (null) > > nfs.disable on > > features.read-only off > > features.worm off > > features.worm-file-level off > > features.worm-files-deletable on > > features.default-retention-period 120 > > features.retention-mode relax > > features.auto-commit-period 180 > > storage.linux-aio off > > storage.batch-fsync-mode reverse-fsync > > storage.batch-fsync-delay-usec 0 > > storage.owner-uid -1 > > storage.owner-gid -1 > > storage.node-uuid-pathinfo off > > storage.health-check-interval 30 > > storage.build-pgfid off > > storage.gfid2path on > > storage.gfid2path-separator : > > storage.reserve 1 > > storage.health-check-timeout 10 > > storage.fips-mode-rchecksum off > > storage.force-create-mode 0000 > > storage.force-directory-mode 0000 > > storage.create-mask 0777 > > storage.create-directory-mask 0777 > > storage.max-hardlinks 100 > > storage.ctime off > > storage.bd-aio off > > config.gfproxyd off > > cluster.server-quorum-type off > > cluster.server-quorum-ratio 0 > > changelog.changelog off > > changelog.changelog-dir {{ brick.path > }}/.glusterfs/changelogs > changelog.encoding ascii > > changelog.rollover-time 15 > > changelog.fsync-interval 5 > > changelog.changelog-barrier-timeout 120 > > changelog.capture-del-path off > > features.barrier disable > > features.barrier-timeout 120 > > features.trash off > > features.trash-dir .trashcan > > features.trash-eliminate-path (null) > > features.trash-max-filesize 5MB > > features.trash-internal-op off > > cluster.enable-shared-storage disable > > cluster.write-freq-threshold 0 > > cluster.read-freq-threshold 0 > > cluster.tier-pause off > > cluster.tier-promote-frequency 120 > > cluster.tier-demote-frequency 3600 > > cluster.watermark-hi 90 > > cluster.watermark-low 75 > > cluster.tier-mode cache > > cluster.tier-max-promote-file-size 0 > > cluster.tier-max-mb 4000 > > cluster.tier-max-files 10000 > > cluster.tier-query-limit 100 > > cluster.tier-compact on > > cluster.tier-hot-compact-frequency 604800 > > cluster.tier-cold-compact-frequency 604800 > > features.ctr-enabled off > > features.record-counters off > > features.ctr-record-metadata-heat off > > features.ctr_link_consistency off > > features.ctr_lookupheal_link_timeout 300 > > features.ctr_lookupheal_inode_timeout 300 > > features.ctr-sql-db-cachesize 12500 > > features.ctr-sql-db-wal-autocheckpoint 25000 > > features.selinux on > > locks.trace off > > locks.mandatory-locking off > > cluster.disperse-self-heal-daemon enable > > cluster.quorum-reads no > > client.bind-insecure (null) > > features.shard off > > features.shard-block-size 64MB > > features.shard-lru-limit 16384 > > features.shard-deletion-rate 100 > > features.scrub-throttle lazy > > features.scrub-freq biweekly > > features.scrub false > > features.expiry-time 120 > > features.cache-invalidation off > > features.cache-invalidation-timeout 60 > > features.leases off > > features.lease-lock-recall-timeout 60 > > disperse.background-heals 8 > > disperse.heal-wait-qlength 128 > > cluster.heal-timeout 600 > > dht.force-readdirp on > > disperse.read-policy gfid-hash > > cluster.shd-max-threads 1 > > cluster.shd-wait-qlength 1024 > > cluster.locking-scheme full > > cluster.granular-entry-heal no > > features.locks-revocation-secs 0 > > features.locks-revocation-clear-all false > > features.locks-revocation-max-blocked 0 > > features.locks-monkey-unlocking false > > features.locks-notify-contention no > > features.locks-notify-contention-delay 5 > > disperse.shd-max-threads 1 > > disperse.shd-wait-qlength 1024 > > disperse.cpu-extensions auto > > disperse.self-heal-window-size 1 > > cluster.use-compound-fops off > > performance.parallel-readdir off > > performance.rda-request-size 131072 > > performance.rda-low-wmark 4096 > > performance.rda-high-wmark 128KB > > performance.rda-cache-limit 10MB > > performance.nl-cache-positive-entry false > > performance.nl-cache-limit 10MB > > performance.nl-cache-timeout 60 > > cluster.brick-multiplex off > > cluster.max-bricks-per-process 0 > > disperse.optimistic-change-log on > > disperse.stripe-cache 4 > > cluster.halo-enabled False > > cluster.halo-shd-max-latency 99999 > > cluster.halo-nfsd-max-latency 5 > > cluster.halo-max-latency 5 > > cluster.halo-max-replicas 99999 > > cluster.halo-min-replicas 2 > > cluster.daemon-log-level INFO > > debug.delay-gen off > > delay-gen.delay-percentage 10% > > delay-gen.delay-duration 100000 > > delay-gen.enable > > disperse.parallel-writes on > > features.sdfs on > > features.cloudsync off > > features.utime off > > ctime.noatime on > > feature.cloudsync-storetype (null) > > > Thanks again. > > > On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86...@yahoo.com> wrote: > > Hi David, > > On Dec 24, 2019 02:47, David Cunningham <dcunning...@voisonics.com> wrote: > > > > Hello, > > > > In testing we found that actually the GFS client having access to all 3 > nodes made no difference to performance. Perhaps that's because the 3rd > node that wasn't accessible from the client before was the arbiter node? > It makes sense, as no data is being generated towards the arbiter. > > Presumably we shouldn't have an arbiter node listed under > backupvolfile-server when mounting the filesystem? Since it doesn't store > all the data surely it can't be used to serve the data. > > I have my arbiter defined as last backup and no issues so far. At least > the admin can easily identify the bricks from the mount options. > > > We did have direct-io-mode=disable already as well, so that wasn't a > factor in the performance problems. > > Have you checked if the client vedsion ia not too old. > Also you can check the cluster's operation cersion: > # gluster volume get all cluster.max-op-version > # gluster volume get all cluster.op-version > > Cluster's op version should be at max-op-version. > > In my mind come 2 options: > A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and > then set the op version to highest possible. > # gluster volume get all cluster.max-op-version > # gluster volume get all cluster.op-version > > B) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and > control the parallel connections from Ganesha). > > Can you provide your Gluster volume's options? > 'gluster volume get <VOLNAME> all' > > > Thanks again for any advice. > > > > > > > > On Mon, 23 Dec 2019 at 13:09, David Cunningham < > dcunning...@voisonics.com> wrote: > >> > >> Hi Strahil, > >> > >> Thanks for that. We do have one backup server specified, but will add > the second backup as well. > >> > >> > >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86...@yahoo.com> wrote: > >>> > >>> Hi David, > >>> > >>> Also consider using the mount option to specify backup server via > 'backupvolfile-server=server2:server3' (you can define more but I don't > thing replica volumes greater that 3 are usefull (maybe in some special > cases). > >>> > >>> In such way, when the primary is lost, your client can reach a backup > one without disruption. > >>> > >>> P.S.: Client may 'hang' - if the primary server got rebooted > ungracefully - as the communication must timeout before FUSE addresses the > next server. There is a special script for killing gluster processes in > '/usr/share/gluster/scripts' which can be used for setting up a systemd > service to do that for you on shutdown. > >>> > >>> Best Regards, > >>> Strahil Nikolov > >>> > >>> On Dec 20, 2019 23:49, David Cunningham <dcunning...@voisonics.com> > wrote: > >>>> > >>>> Hi Stahil, > >>>> > >>>> Ah, that is an important point. One of the nodes is not accessible > from the client, and we assumed that it only needed to reach the GFS node > that was mounted so didn't think anything of it. > >>>> > >>>> We will try making all nodes accessible, as well as > "direct-io-mode=disable". > >>>> > >>>> Thank you. > >>>> > >>>> > >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86...@yahoo.com> > wrote: > >>>>> > >>>>> Actually I haven't clarified myself. > >>>>> FUSE mounts on the client side is connecting directly to all bricks > consisted of the volume. > >>>>> If for some reason (bad routing, firewall blocked) there could be > cases where the client can reach 2 out of 3 bricks and this can constantly > cause healing to happen (as one of the bricks is never updated) which will > degrade the performance and cause excessive network usage. > >>>>> As your attachment is from one of the gluster nodes, this could be > the case. > >>>>> > >>>>> Best Regards, > >>>>> Strahil Nikolov > >>>>> > >>>>> В петък, 20 декември 2019 г., 01:49:56 ч. Гринуич+2, David > Cunningham <dcunning...@voisonics.com> написа: > >>>>> > >>>>> > >>>>> Hi Strahil, > >>>>> > >>>>> The chart attached to my original email is taken from the GFS server. > >>>>> > >>>>> I'm not sure what you mean by accessing all bricks simultaneously. > We've mounted it from the client like this: > >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs > defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10 > 0 0 > >>>>> > >>>>> Should we do something different to access all bricks simultaneously? > >>>>> > >>>>> Thanks for your help! > >>>>> > >>>>> > >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <hunter86...@yahoo.com> > wrote: > >>>>>> > >>>>>> I'm not sure if you did measure the traffic from client side > (tcpdump on a client machine) or from Server side. > >>>>>> > >>>>>> In both cases , please verify that the client accesses all bricks > simultaneously, as this can cause unnecessary heals. > >>>>>> > >>>>>> Have you thought about upgrading to v6? There are some enhancements > in v6 which could be beneficial. > >>>>>> > >>>>>> Yet, it is indeed strange that so much traffic is generated with > FUSE. > >>>>>> > >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and > can natively speak with Gluster, which cant bring you closer to the > previous setup and also provide some extra performance. > >>>>>> > >>>>>> > >>>>>> Best Regards, > >>>>>> Strahil Nikolov > >>>>>> > >>>>>> > >>>>>> > >> > >> > >> -- > >> David Cunningham, Voisonics Limited > >> http://voisonics.com/ > >> USA: +1 213 221 1092 > >> New Zealand: +64 (0)28 2558 3782 > > > > > > > > -- > > David Cunningham, Voisonics Limited > > http://voisonics.com/ > > USA: +1 213 221 1092 > > New Zealand: +64 (0)28 2558 3782 > > Best Regards, > Strahil Nikolov > > > > -- > David Cunningham, Voisonics Limited > http://voisonics.com/ > USA: +1 213 221 1092 > New Zealand: +64 (0)28 2558 3782 > > > > -- > David Cunningham, Voisonics Limited > http://voisonics.com/ > USA: +1 213 221 1092 > New Zealand: +64 (0)28 2558 3782 > > -- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users