Re: [Gluster-users] Lots of connections on clients - appropriate values for various thread parameters
Hi Raghavendra, i'll try to gather the information you need, hopefully this weekend. One thing i've done this week: deactivate performance.quick-read (https://bugzilla.redhat.com/show_bug.cgi?id=1673058), which (according to munin) ended in a massive drop in network traffic and a slightly lower iowait. Maybe that has helped already. We'll see. performance.nl-cache is deactivated due to unreadable files/directories; we have a highly concurrent workload. There are some nginx backend webservers that check if a requested file exists in the glusterfs filesystem; i counted the log entries, this can be up to 5 million entries a day; about 2/3 of the files are found in the filesystem, they get delivered to the frontend; if not: the nginx's send the request via round robin to 3 backend tomcats, and they have to check whether a directory exists or not (and then create it and the requested files). So it happens that tomcatA creates a directory and a file in it, and within (milli)seconds tomcatB+C create additional files in this dir. Deactivating nl-cache helped to solve this issue, after having conversation with Nithya and Ravishankar. Just wanted to explain that. Thx so far, Hubert Am Fr., 29. März 2019 um 06:29 Uhr schrieb Raghavendra Gowdappa : > > +Gluster-users > > Sorry about the delay. There is nothing suspicious about per thread CPU > utilization of glusterfs process. However looking at the volume profile > attached I see huge number of lookups. I think if we cutdown the number of > lookups probably we'll see improvements in performance. I need following > information: > > * dump of fuse traffic under heavy load (use --dump-fuse option while > mounting) > * client volume profile for the duration of heavy load - > https://docs.gluster.org/en/latest/Administrator%20Guide/Performance%20Testing/ > * corresponding brick volume profile > > Basically I need to find out > * whether these lookups are on existing files or non-existent files > * whether they are on directories or files > * why/whether md-cache or kernel attribute cache or nl-cache will help to cut > down lookups. > > regards, > Raghavendra > > On Mon, Mar 25, 2019 at 12:13 PM Hu Bert wrote: >> >> Hi Raghavendra, >> >> sorry, this took a while. The last weeks the weather was bad -> less >> traffic, but this weekend there was a massive peak. I made 3 profiles >> with top, but at first look there's nothing special here. >> >> I also made a gluster profile (on one of the servers) at a later >> moment. Maybe that helps. I also added some munin graphics from 2 of >> the clients and 1 graphic of server network, just to show how massive >> the problem is. >> >> Just wondering if the high io wait is related to the high network >> traffic bug (https://bugzilla.redhat.com/show_bug.cgi?id=1673058); if >> so, i could deactivate performance.quick-read and check if there is >> less iowait. If that helps: wonderful - and yearningly awaiting >> updated packages (e.g. v5.6). If not: maybe we have to switch from our >> normal 10TB hdds (raid10) to SSDs if the problem is based on slow >> hardware in the use case of small files (images). >> >> >> Thx, >> Hubert >> >> Am Mo., 4. März 2019 um 16:59 Uhr schrieb Raghavendra Gowdappa >> : >> > >> > Were you seeing high Io-wait when you captured the top output? I guess not >> > as you mentioned the load increases during weekend. Please note that this >> > data has to be captured when you are experiencing problems. >> > >> > On Mon, Mar 4, 2019 at 8:02 PM Hu Bert wrote: >> >> >> >> Hi, >> >> sending the link directly to you and not the list, you can distribute >> >> if necessary. the command ran for about half a minute. Is that enough? >> >> More? Less? >> >> >> >> https://download.outdooractive.com/top.output.tar.gz >> >> >> >> Am Mo., 4. März 2019 um 15:21 Uhr schrieb Raghavendra Gowdappa >> >> : >> >> > >> >> > >> >> > >> >> > On Mon, Mar 4, 2019 at 7:47 PM Raghavendra Gowdappa >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Mar 4, 2019 at 4:26 PM Hu Bert wrote: >> >> >>> >> >> >>> Hi Raghavendra, >> >> >>> >> >> >>> at the moment iowait and cpu consumption is quite low, the main >> >> >>> problems appear during the weekend (high traffic, especially on >> >> >>> sunday), so either we have to wait until next sunday or use a time >> >> >>> machine ;-) >> >> >>> >> >> >>> I made a screenshot of top (https://abload.de/img/top-hvvjt2.jpg) and >> >> >>> a text output (https://pastebin.com/TkTWnqxt), maybe that helps. Seems >> >> >>> like processes like glfs_fuseproc (>204h) and glfs_epoll (64h for each >> >> >>> process) consume a lot of CPU (uptime 24 days). Is that already >> >> >>> helpful? >> >> >> >> >> >> >> >> >> Not much. The TIME field just says the amount of time the thread has >> >> >> been executing. Since its a long standing mount, we can expect such >> >> >> large values. But, the value itself doesn't indicate whether the >> >> >> thread itself was overloaded at any (some) interval(s
[Gluster-users] Upgrade testing to gluster 6
Hello Gluster users, As you all aware that glusterfs-6 is out, we would like to inform you that, we have spent a significant amount of time in testing glusterfs-6 in upgrade scenarios. We have done upgrade testing to glusterfs-6 from various releases like 3.12, 4.1 and 5.3. As glusterfs-6 has got in a lot of changes, we wanted to test those portions. There were xlators (and respective options to enable/disable them) added and deprecated in glusterfs-6 from various versions [1]. We had to check the following upgrade scenarios for all such options Identified in [1]: 1) option never enabled and upgraded 2) option enabled and then upgraded 3) option enabled and then disabled and then upgraded We weren't manually able to check all the combinations for all the options. So the options involving enabling and disabling xlators were prioritized. The below are the result of the ones tested. Never enabled and upgraded: checked from 3.12, 4.1, 5.3 to 6 the upgrade works. Enabled and upgraded: Tested for tier which is deprecated, It is not a recommended upgrade. As expected the volume won't be consumable and will have a few more issues as well. Tested with 3.12, 4.1 and 5.3 to 6 upgrade. Enabled, disabled before upgrade. Tested for tier with 3.12 and the upgrade went fine. There is one common issue to note in every upgrade. The node being upgraded is going into disconnected state. You have to flush the iptables and the restart glusterd on all nodes to fix this. The testing for enabling new options is still pending. The new options won't cause as much issues as the deprecated ones so this was put at the end of the priority list. It would be nice to get contributions for this. For the disable testing, tier was used as it covers most of the xlator that was removed. And all of these tests were done on a replica 3 volume. Note: This is only for upgrade testing of the newly added and removed xlators. Does not involve the normal tests for the xlator. If you have any questions, please feel free to reach us. [1] https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing Regards, Hari and Sanju. ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Lots of connections on clients - appropriate values for various thread parameters
+Gluster-users Sorry about the delay. There is nothing suspicious about per thread CPU utilization of glusterfs process. However looking at the volume profile attached I see huge number of lookups. I think if we cutdown the number of lookups probably we'll see improvements in performance. I need following information: * dump of fuse traffic under heavy load (use --dump-fuse option while mounting) * client volume profile for the duration of heavy load - https://docs.gluster.org/en/latest/Administrator%20Guide/Performance%20Testing/ * corresponding brick volume profile Basically I need to find out * whether these lookups are on existing files or non-existent files * whether they are on directories or files * why/whether md-cache or kernel attribute cache or nl-cache will help to cut down lookups. regards, Raghavendra On Mon, Mar 25, 2019 at 12:13 PM Hu Bert wrote: > Hi Raghavendra, > > sorry, this took a while. The last weeks the weather was bad -> less > traffic, but this weekend there was a massive peak. I made 3 profiles > with top, but at first look there's nothing special here. > > I also made a gluster profile (on one of the servers) at a later > moment. Maybe that helps. I also added some munin graphics from 2 of > the clients and 1 graphic of server network, just to show how massive > the problem is. > > Just wondering if the high io wait is related to the high network > traffic bug (https://bugzilla.redhat.com/show_bug.cgi?id=1673058); if > so, i could deactivate performance.quick-read and check if there is > less iowait. If that helps: wonderful - and yearningly awaiting > updated packages (e.g. v5.6). If not: maybe we have to switch from our > normal 10TB hdds (raid10) to SSDs if the problem is based on slow > hardware in the use case of small files (images). > > > Thx, > Hubert > > Am Mo., 4. März 2019 um 16:59 Uhr schrieb Raghavendra Gowdappa > : > > > > Were you seeing high Io-wait when you captured the top output? I guess > not as you mentioned the load increases during weekend. Please note that > this data has to be captured when you are experiencing problems. > > > > On Mon, Mar 4, 2019 at 8:02 PM Hu Bert wrote: > >> > >> Hi, > >> sending the link directly to you and not the list, you can distribute > >> if necessary. the command ran for about half a minute. Is that enough? > >> More? Less? > >> > >> https://download.outdooractive.com/top.output.tar.gz > >> > >> Am Mo., 4. März 2019 um 15:21 Uhr schrieb Raghavendra Gowdappa > >> : > >> > > >> > > >> > > >> > On Mon, Mar 4, 2019 at 7:47 PM Raghavendra Gowdappa < > rgowd...@redhat.com> wrote: > >> >> > >> >> > >> >> > >> >> On Mon, Mar 4, 2019 at 4:26 PM Hu Bert > wrote: > >> >>> > >> >>> Hi Raghavendra, > >> >>> > >> >>> at the moment iowait and cpu consumption is quite low, the main > >> >>> problems appear during the weekend (high traffic, especially on > >> >>> sunday), so either we have to wait until next sunday or use a time > >> >>> machine ;-) > >> >>> > >> >>> I made a screenshot of top (https://abload.de/img/top-hvvjt2.jpg) > and > >> >>> a text output (https://pastebin.com/TkTWnqxt), maybe that helps. > Seems > >> >>> like processes like glfs_fuseproc (>204h) and glfs_epoll (64h for > each > >> >>> process) consume a lot of CPU (uptime 24 days). Is that already > >> >>> helpful? > >> >> > >> >> > >> >> Not much. The TIME field just says the amount of time the thread has > been executing. Since its a long standing mount, we can expect such large > values. But, the value itself doesn't indicate whether the thread itself > was overloaded at any (some) interval(s). > >> >> > >> >> Can you please collect output of following command and send back the > collected data? > >> >> > >> >> # top -bHd 3 > top.output > >> > > >> > > >> > Please collect this on problematic mounts and bricks. > >> > > >> >> > >> >>> > >> >>> > >> >>> Hubert > >> >>> > >> >>> Am Mo., 4. März 2019 um 11:31 Uhr schrieb Raghavendra Gowdappa > >> >>> : > >> >>> > > >> >>> > what is the per thread CPU usage like on these clients? With > highly concurrent workloads we've seen single thread that reads requests > from /dev/fuse (fuse reader thread) becoming bottleneck. Would like to know > what is the cpu usage of this thread looks like (you can use top -H). > >> >>> > > >> >>> > On Mon, Mar 4, 2019 at 3:39 PM Hu Bert > wrote: > >> >>> >> > >> >>> >> Good morning, > >> >>> >> > >> >>> >> we use gluster v5.3 (replicate with 3 servers, 2 volumes, raid10 > as > >> >>> >> brick) with at the moment 10 clients; 3 of them do heavy I/O > >> >>> >> operations (apache tomcats, read+write of (small) images). These > 3 > >> >>> >> clients have a quite high I/O wait (stats from yesterday) as can > be > >> >>> >> seen here: > >> >>> >> > >> >>> >> client: https://abload.de/img/client1-cpu-dayulkza.png > >> >>> >> server: https://abload.de/img/server1-cpu-dayayjdq.png > >> >>> >> > >> >>> >> The iowait in the graphics differ a lot. I checked netstat for > the > >> >>> >> diffe
Re: [Gluster-users] Inconsistent issues with a client
Hi, If you know which directories are problematic, please check and see if the permissions on them are correct on the individual bricks. Please also provide the following: - *gluster volume info* for the volume - The gluster version you are running regards, Nithya On Wed, 27 Mar 2019 at 19:10, Tami Greene wrote: > The system is a 5 server, 20 brick distributed system with a hardware > configured RAID 6 underneath with xfs as filesystem. This client is a data > collection node which transfers data to specific directories within one of > the gluster volumes. > > > > I have a client with submounted directories (glustervolume/project) rather > than the entire volume. Some files can be transferred no problem, but > others send an error about transport endpoint not connected. The transfer > is handed by a rsync script triggered as a cron job. > > > > When remotely connected to this client, user access to these files does > not always behave as they are set – 2770 for directories and 440. Owners > are not always able to move the files, processes ran as the owners are not > always able to move files; root is not always allowed to move or delete > these file. > > > > This process seemed to worked smoothly before adding another server and 4 > storage bricks to the volume, logs indicate there were intermittent issues > at least a month before the last server was added. While a new collection > device has been streaming to this one machine, the issue started the day > before. > > > > Is there another level for permissions and ownership that I am not aware > of that needs to be sync’d? > > > -- > Tami > ___ > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] [Event CfP Announce] DevConf events India and US in the month of August 2019
2 editions of DevConf have their CfPs open [1] DevConf India : https://devconf.info/in (event dates 02, 03 Aug 2019, Bengaluru) [2] DevConf USA : https://devconf.info/us/ (event dates 15 -17 Aug, 2019, Boston) The DevConf events are well curated to get a good mix of developers and users. This note is to raise awareness and encourage submission of talks around Gluster, containerized storage and similar. ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Geo-replication status always on 'Created'
Hi, In my glusterd.log i am seeing this error message , is this related to the patch i applied? or do i need to open a new thread? I [MSGID: 106327] [glusterd-geo-rep.c:4483:glusterd_read_status_file] 0-management: Using passed config template(/var/lib/glusterd/geo-replication/vol_75a5fd373d88ba687f591f3353fa05cf_172.16.201.35_vol_e783a730578e45ed9d51b9a80df6c33f/gsyncd.conf). [2019-03-28 10:39:29.493554] E [MSGID: 106293] [glusterd-geo-rep.c:679:glusterd_query_extutil_generic] 0-management: reading data from child failed [2019-03-28 10:39:29.493589] E [MSGID: 106305] [glusterd-geo-rep.c:4377:glusterd_fetch_values_from_config] 0-management: Unable to get configuration data for vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35: :vol_e783a730578e45ed9d51b9a80df6c33f(slave) [2019-03-28 10:39:29.493617] E [MSGID: 106328] [glusterd-geo-rep.c:4517:glusterd_read_status_file] 0-management: Unable to fetch config values for vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f(slave). Trying default config template [2019-03-28 10:39:29.553846] E [MSGID: 106328] [glusterd-geo-rep.c:4525:glusterd_read_status_file] 0-management: Unable to fetch config values for vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f(slave) [2019-03-28 10:39:29.553836] E [MSGID: 106293] [glusterd-geo-rep.c:679:glusterd_query_extutil_generic] 0-management: reading data from child failed [2019-03-28 10:39:29.553844] E [MSGID: 106305] [glusterd-geo-rep.c:4377:glusterd_fetch_values_from_config] 0-management: Unable to get configuration data for vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35: :vol_e783a730578e45ed9d51b9a80df6c33f(slave) also while do a status call, i am not seeing one of the nodes which was reporting 'Passive' before ( did not change any configuration ) , any ideas how to troubleshoot this? thanks for your help. Maurya On Tue, Mar 26, 2019 at 8:34 PM Aravinda wrote: > Please check error message in gsyncd.log file in > /var/log/glusterfs/geo-replication/ > > On Tue, 2019-03-26 at 19:44 +0530, Maurya M wrote: > > Hi Arvind, > > Have patched my setup with your fix: re-run the setup, but this time > > getting a different error where it failed to commit the ssh-port on > > my other 2 nodes on the master cluster, so manually copied the : > > [vars] > > ssh-port = > > > > into gsyncd.conf > > > > and status reported back is as shown below : Any ideas how to > > troubleshoot this? > > > > MASTER NODE MASTER VOL MASTER > > BRICK > >SLAVE USERSLAVE > > SLAVE NODE STATUS > > CRAWL STATUSLAST_SYNCED > > --- > > --- > > --- > > --- > > -- > > 172.16.189.4 vol_75a5fd373d88ba687f591f3353fa05cf > > /var/lib/heketi/mounts/vg_aee3df7b0bb2451bc00a73358c5196a2/brick_116f > > b9427fb26f752d9ba8e45e183cb1/brickroot > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f172.16.201.4 > > PassiveN/A N/A > > 172.16.189.35vol_75a5fd373d88ba687f591f3353fa05cf > > /var/lib/heketi/mounts/vg_05708751110fe60b3e7da15bdcf6d4d4/brick_266b > > b08f0d466d346f8c0b19569736fb/brickroot > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A > > Faulty N/A N/A > > 172.16.189.66vol_75a5fd373d88ba687f591f3353fa05cf > > /var/lib/heketi/mounts/vg_4b92a2b687e59b7311055d3809b77c06/brick_dfa4 > > 4c9380cdedac708e27e2c2a443a0/brickroot > > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A > > Initializing...N/A N/A > > > > > > > > > > On Tue, Mar 26, 2019 at 1:40 PM Aravinda wrote: > > > I got chance to investigate this issue further and identified a > > > issue > > > with Geo-replication config set and sent patch to fix the same. > > > > > > BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1692666 > > > Patch: https://review.gluster.org/22418 > > > > > > On Mon, 2019-03-25 at 15:37 +0530, Maurya M wrote: > > > > ran this command : ssh -p -i /var/lib/glusterd/geo- > > > > replication/secret.pem root@gluster volume info -- > > > xml > > > > > > > > attaching the output. > > > > > > > > > > > > > > > > On Mon, Mar 25, 2019 at 2:13 PM Aravinda > > > wrote: > > > > > Geo-rep is running `ssh -i /var/lib/glusterd/geo- > > > > > replication/secret.pem > > > > > root@ gluster volume info --xml` and parsing its > > > output. > > > > > Please try to to run the command from the same node and let us > > > know > > > > > the > > > > > output. > > > > > > > > > > > > > > > On Mon, 2019-03-25 at 11:43 +0530, Maurya M wrote: > > > > > > Now the error is on the sam
Re: [Gluster-users] Gluster GEO replication fault after write over nfs-ganesha
On 3/27/19 7:39 PM, Alexey Talikov wrote: I have two clusters with dispersed volumes (2+1) with GEO replication It works fine till I use glusterfs-fuse, but as even one file written over nfs-ganesha replication goes to Fault and recovers after I remove this file (sometimes after stop/start) I think nfs-hanesha writes file in some way that produces problem with replication I am not much familiar with geo-rep and not sure what/why exactly failed here. Request Kotresh (cc'ed) to take a look and provide his insights on the issue. Thanks, Soumya |OSError: [Errno 61] No data available: '.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8' | but if I check over glusterfs mounted with aux-gfid-mount |getfattr -n trusted.glusterfs.pathinfo -e text /mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8 getfattr: Removing leading '/' from absolute path names # file: mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8 trusted.glusterfs.pathinfo="( ( ))" | File exists Details available here https://github.com/nfs-ganesha/nfs-ganesha/issues/408 ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write
On 2/8/19 11:53 AM, Soumya Koduri wrote: On 2/8/19 3:20 AM, Maurits Lamers wrote: Hi, [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(yøêÙ MzîSL4_@) failed [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTnôEU«H.[2019-02-07 10:11:24.833299] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(gÇLÁèFà»0bЯk) failed [2019-02-07 10:25:01.642509] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: server [node1]:49152 has not responded in the last 42 seconds, disconnecting. [2019-02-07 10:25:01.642805] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server [node2]:49152 has not responded in the last 42 seconds, disconnecting. [2019-02-07 10:25:01.642946] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: server [node3]:49152 has not responded in the last 42 seconds, disconnecting. [2019-02-07 10:25:02.643120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server 127.0.1.1:49152 has not responded in the last 42 seconds, disconnecting. [2019-02-07 10:25:02.643314] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server [node4]:49152 has not responded in the last 42 seconds, disconnecting. Strange that synctask failed. Could you please turn off features.cache-invalidation volume option and check if the issue still persists. Turning the cache invalidation option off seems to have solved the freeze. Still testing, but it looks promising. If thats the case, please turn on cache invalidation option back and collect couple of stack traces (using gstack) when the system freezes again. FYI - Have got a chance to reproduce and RCA the issue [1]. Posted fix for review in the upstream [2] Thanks, Soumya [1] https://bugzilla.redhat.com/show_bug.cgi?id=1693575 [2] https://review.gluster.org/22436 Thanks, Soumya cheers Maurits ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Prioritise local bricks for IO?
On Wed, 27 Mar 2019 at 20:27, Poornima Gurusiddaiah wrote: > This feature is not under active development as it was not used widely. > AFAIK its not supported feature. > +Nithya +Raghavendra for further clarifications. > This is not actively supported - there has been no work done on this feature for a long time. Regards, Nithya > > Regards, > Poornima > > On Wed, Mar 27, 2019 at 12:33 PM Lucian wrote: > >> Oh, that's just what the doctor ordered! >> Hope it works, thanks >> >> On 27 March 2019 03:15:57 GMT, Vlad Kopylov wrote: >>> >>> I don't remember if it still in works >>> NUFA >>> >>> https://github.com/gluster/glusterfs-specs/blob/master/done/Features/nufa.md >>> >>> v >>> >>> On Tue, Mar 26, 2019 at 7:27 AM Nux! wrote: >>> Hello, I'm trying to set up a distributed backup storage (no replicas), but I'd like to prioritise the local bricks for any IO done on the volume. This will be a backup stor, so in other words, I'd like the files to be written locally if there is space, so as to save the NICs for other traffic. Anyone knows how this might be achievable, if at all? -- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users >>> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. >> ___ >> Gluster-users mailing list >> Gluster-users@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > ___ > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] POSIX locks and disconnections between clients and bricks
On Thu, Mar 28, 2019 at 2:37 PM Xavi Hernandez wrote: > On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa > wrote: > >> >> >> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez >> wrote: >> >>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri < >>> pkara...@redhat.com> wrote: >>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez wrote: > On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > >> >> >> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez >> wrote: >> >>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa < >>> rgowd...@redhat.com> wrote: >>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez < jaher...@redhat.com> wrote: > Hi Raghavendra, > > On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa < > rgowd...@redhat.com> wrote: > >> All, >> >> Glusterfs cleans up POSIX locks held on an fd when the >> client/mount through which those locks are held disconnects from >> bricks/server. This helps Glusterfs to not run into a stale lock >> problem >> later (For eg., if application unlocks while the connection was still >> down). However, this means the lock is no longer exclusive as other >> applications/clients can acquire the same lock. To communicate that >> locks >> are no longer valid, we are planning to mark the fd (which has POSIX >> locks) >> bad on a disconnect so that any future operations on that fd will >> fail, >> forcing the application to re-open the fd and re-acquire locks it >> needs [1]. >> > > Wouldn't it be better to retake the locks when the brick is > reconnected if the lock is still in use ? > There is also a possibility that clients may never reconnect. That's the primary reason why bricks assume the worst (client will not reconnect) and cleanup the locks. >>> >>> True, so it's fine to cleanup the locks. I'm not saying that locks >>> shouldn't be released on disconnect. The assumption is that if the >>> client >>> has really died, it will also disconnect from other bricks, who will >>> release the locks. So, eventually, another client will have enough >>> quorum >>> to attempt a lock that will succeed. In other words, if a client gets >>> disconnected from too many bricks simultaneously (loses Quorum), then >>> that >>> client can be considered as bad and can return errors to the >>> application. >>> This should also cause to release the locks on the remaining connected >>> bricks. >>> >>> On the other hand, if the disconnection is very short and the client >>> has not died, it will keep enough locked files (it has quorum) to avoid >>> other clients to successfully acquire a lock. In this case, if the >>> brick is >>> reconnected, all existing locks should be reacquired to recover the >>> original state before the disconnection. >>> >>> > BTW, the referenced bug is not public. Should we open another bug > to track this ? > I've just opened up the comment to give enough context. I'll open a bug upstream too. > > >> >> Note that with AFR/replicate in picture we can prevent errors to >> application as long as Quorum number of children "never ever" lost >> connection with bricks after locks have been acquired. I am using >> the term >> "never ever" as locks are not healed back after re-connection and >> hence >> first disconnect would've marked the fd bad and the fd remains so >> even >> after re-connection happens. So, its not just Quorum number of >> children >> "currently online", but Quorum number of children "never having >> disconnected with bricks after locks are acquired". >> > > I think this requisite is not feasible. In a distributed file > system, sooner or later all bricks will be disconnected. It could be > because of failures or because an upgrade is done, but it will happen. > > The difference here is how long are fd's kept open. If > applications open and close files frequently enough (i.e. the fd is > not > kept open more time than it takes to have more than Quorum bricks > disconnected) then there's no problem. The problem can only appear on > applications that open files for a long time and also use posix > locks. In > this case, the only good solution I see is to retake the locks on > brick > reconnection. > A
Re: [Gluster-users] POSIX locks and disconnections between clients and bricks
On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa wrote: > > > On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez > wrote: > >> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >>> >>> >>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez >>> wrote: >>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez > wrote: > >> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa < >> rgowd...@redhat.com> wrote: >> >>> >>> >>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez >>> wrote: >>> Hi Raghavendra, On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa < rgowd...@redhat.com> wrote: > All, > > Glusterfs cleans up POSIX locks held on an fd when the > client/mount through which those locks are held disconnects from > bricks/server. This helps Glusterfs to not run into a stale lock > problem > later (For eg., if application unlocks while the connection was still > down). However, this means the lock is no longer exclusive as other > applications/clients can acquire the same lock. To communicate that > locks > are no longer valid, we are planning to mark the fd (which has POSIX > locks) > bad on a disconnect so that any future operations on that fd will > fail, > forcing the application to re-open the fd and re-acquire locks it > needs [1]. > Wouldn't it be better to retake the locks when the brick is reconnected if the lock is still in use ? >>> >>> There is also a possibility that clients may never reconnect. >>> That's the primary reason why bricks assume the worst (client will not >>> reconnect) and cleanup the locks. >>> >> >> True, so it's fine to cleanup the locks. I'm not saying that locks >> shouldn't be released on disconnect. The assumption is that if the client >> has really died, it will also disconnect from other bricks, who will >> release the locks. So, eventually, another client will have enough quorum >> to attempt a lock that will succeed. In other words, if a client gets >> disconnected from too many bricks simultaneously (loses Quorum), then >> that >> client can be considered as bad and can return errors to the application. >> This should also cause to release the locks on the remaining connected >> bricks. >> >> On the other hand, if the disconnection is very short and the client >> has not died, it will keep enough locked files (it has quorum) to avoid >> other clients to successfully acquire a lock. In this case, if the brick >> is >> reconnected, all existing locks should be reacquired to recover the >> original state before the disconnection. >> >> >>> BTW, the referenced bug is not public. Should we open another bug to track this ? >>> >>> I've just opened up the comment to give enough context. I'll open a >>> bug upstream too. >>> >>> > > Note that with AFR/replicate in picture we can prevent errors to > application as long as Quorum number of children "never ever" lost > connection with bricks after locks have been acquired. I am using the > term > "never ever" as locks are not healed back after re-connection and > hence > first disconnect would've marked the fd bad and the fd remains so even > after re-connection happens. So, its not just Quorum number of > children > "currently online", but Quorum number of children "never having > disconnected with bricks after locks are acquired". > I think this requisite is not feasible. In a distributed file system, sooner or later all bricks will be disconnected. It could be because of failures or because an upgrade is done, but it will happen. The difference here is how long are fd's kept open. If applications open and close files frequently enough (i.e. the fd is not kept open more time than it takes to have more than Quorum bricks disconnected) then there's no problem. The problem can only appear on applications that open files for a long time and also use posix locks. In this case, the only good solution I see is to retake the locks on brick reconnection. >>> >>> Agree. But lock-healing should be done only by HA layers like AFR/EC >>> as only they know whether there are enough online bricks to have >>> prevented >>> any conflicting lock. Protocol/client itself doesn't have enough >>> information