Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
+1 for "no-rewinddir-support" option in DHT. We are seeing very slow directory listing specially with 1500+ brick volume, 'ls' takes 20+ second with 1000+ files. On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappawrote: > > > - Original Message - >> From: "Keiviw" >> To: gluster-de...@gluster.org >> Sent: Tuesday, November 1, 2016 12:41:02 PM >> Subject: [Gluster-devel] A question of GlusterFS dentries! >> >> Hi, >> In GlusterFS distributed volumes, listing a non-empty directory was slow. >> Then I read the dht codes and found the reasons. But I was confused that >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not >> use multi-thread to read dentries from multiple bricks simultaneously. >> That's a question that's always puzzled me, Couly you please tell me >> something about this??? > > readdir across subvols is sequential mostly because we have to support > rewinddir(3). We need to maintain the mapping of offset and dentry across > multiple invocations of readdir. In other words if someone did a rewinddir to > an offset corresponding to earlier dentry, subsequent readdirs should return > same set of dentries what the earlier invocation of readdir returned. For > example, in an hypothetical scenario, readdir returned following dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get > following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the > dentries present on that subvol. However, across subvols it is the > responsibility of DHT to provide the above guarantee. Which means we > should've some well defined order in which we send readdir calls (Note that > order is not well defined if we do a parallel readdir across all subvols). > So, DHT has sequential readdir which is a well defined order of reading > dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to the > subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, > e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done > parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 > ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. > However, if we do a readdir sequentially across subvols there is only one > directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to > support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel > readdir can easily be implemented (which improves performance too). But > unfortunately rewinddir is still a POSIX requirement. This also opens up > another possibility of a "no-rewinddir-support" option in DHT, which if > enabled results in parallel readdirs across subvols. What I am not sure is > how many users still use rewinddir? If there is a critical mass which wants > performance with a tradeoff of no rewinddir support this can be a good > feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > >> >> >> >> >> >> >> ___ >> Gluster-devel mailing list >> gluster-de...@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
- Original Message - > From: "Raghavendra Gowdappa"> To: "Keiviw" > Cc: gluster-de...@gluster.org, "gluster-users" > Sent: Wednesday, November 2, 2016 9:38:46 AM > Subject: Re: [Gluster-devel] A question of GlusterFS dentries! > > > > - Original Message - > > From: "Keiviw" > > To: gluster-de...@gluster.org > > Sent: Tuesday, November 1, 2016 12:41:02 PM > > Subject: [Gluster-devel] A question of GlusterFS dentries! > > > > Hi, > > In GlusterFS distributed volumes, listing a non-empty directory was slow. > > Then I read the dht codes and found the reasons. But I was confused that > > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not > > use multi-thread to read dentries from multiple bricks simultaneously. > > That's a question that's always puzzled me, Couly you please tell me > > something about this??? > > readdir across subvols is sequential mostly because we have to support > rewinddir(3). We need to maintain the mapping of offset and dentry across > multiple invocations of readdir. In other words if someone did a rewinddir > to an offset corresponding to earlier dentry, subsequent readdirs should > return same set of dentries what the earlier invocation of readdir returned. > For example, in an hypothetical scenario, readdir returned following > dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get > following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the > dentries present on that subvol. However, across subvols it is the > responsibility of DHT to provide the above guarantee. Which means we > should've some well defined order in which we send readdir calls (Note that > order is not well defined if we do a parallel readdir across all subvols). > So, DHT has sequential readdir which is a well defined order of reading > dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to the s/addiction/addition/ > subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, > e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir > done parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 I think it is 10P10 = 3628800. But again it is not completely random selection as readdir on a single subvol still gives one ordering, so the value is much less. The point here is that there can be many possible listings with parallel readdir. > ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. > However, if we do a readdir sequentially across subvols there is only one > directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to > support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel > readdir can easily be implemented (which improves performance too). But > unfortunately rewinddir is still a POSIX requirement. This also opens up > another possibility of a "no-rewinddir-support" option in DHT, which if > enabled results in parallel readdirs across subvols. What I am not sure is > how many users still use rewinddir? If there is a critical mass which wants > performance with a tradeoff of no rewinddir support this can be a good > feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > > > > > > > > > > > > > > > ___ > > Gluster-devel mailing list > > gluster-de...@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!
- Original Message - > From: "Keiviw"> To: gluster-de...@gluster.org > Sent: Tuesday, November 1, 2016 12:41:02 PM > Subject: [Gluster-devel] A question of GlusterFS dentries! > > Hi, > In GlusterFS distributed volumes, listing a non-empty directory was slow. > Then I read the dht codes and found the reasons. But I was confused that > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not > use multi-thread to read dentries from multiple bricks simultaneously. > That's a question that's always puzzled me, Couly you please tell me > something about this??? readdir across subvols is sequential mostly because we have to support rewinddir(3). We need to maintain the mapping of offset and dentry across multiple invocations of readdir. In other words if someone did a rewinddir to an offset corresponding to earlier dentry, subsequent readdirs should return same set of dentries what the earlier invocation of readdir returned. For example, in an hypothetical scenario, readdir returned following dentries: 1. a, off=10 2. b, off=2 3. c, off=5 4. d, off=15 5. e, off=17 6. f, off=13 Now if we did rewinddir to off 5 and issue readdir again we should get following dentries: (c, off=5), (d, off=15), (e, off=17), (f, off=13) Within a subvol backend filesystem provides rewinddir guarantee for the dentries present on that subvol. However, across subvols it is the responsibility of DHT to provide the above guarantee. Which means we should've some well defined order in which we send readdir calls (Note that order is not well defined if we do a parallel readdir across all subvols). So, DHT has sequential readdir which is a well defined order of reading dentries. To give an example if we have another subvol - subvol2 - (in addiction to the subvol above - say subvol1) with following listing: 1. g, off=16 2. h, off=20 3. i, off=3 4. j, off=19 With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done parallely): 1. A complete listing of the directory (which can be any one of 10P1 = 10 ways - I hope math is correct here). 2. Do rewinddir (20) We cannot predict what are the set of dentries that come _after_ offset 20. However, if we do a readdir sequentially across subvols there is only one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to support rewinddir. If there is no POSIX requirement for rewinddir support, I think a parallel readdir can easily be implemented (which improves performance too). But unfortunately rewinddir is still a POSIX requirement. This also opens up another possibility of a "no-rewinddir-support" option in DHT, which if enabled results in parallel readdirs across subvols. What I am not sure is how many users still use rewinddir? If there is a critical mass which wants performance with a tradeoff of no rewinddir support this can be a good feature. +gluster-users to get an opinion on this. regards, Raghavendra > > > > > > > ___ > Gluster-devel mailing list > gluster-de...@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Improving IOPS
And after having posted about the dangers of premature optimisation ... any suggestion for improving IOPS? as per earlier suggestions I tried setting server.event-threads and client.event-threads to 4, but it made no real difference. nb: the limiting factor on my cluster is the network (2 * 1G). -- Lindsay Mathieson ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Shared Heal Times
Just an update - after resetting all the heal "optimisations" :) I set, in general heals are much faster and back to normal. I've done several rolling upgrades with the servers since, rebooting each one in turn. Usually around 300 64MB shards will need healing after each boot. Its spends about 2-3 minutes doing some fairly intensive CPU, then another 10 minutes to complete the heal. All up around 15 minutes per server. I'm more than satisfied with that. So no real problem other than PEBKAC. Moral of the story - as always, tuning settings for optimisation almost never works. thanks, -- Lindsay Mathieson ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Following up on Community Bootstrap Challenge
On Mon, Oct 24, 2016 at 10:20 AM, Amye Scavardawrote: > Notes from our Gluster Developer Summit 2016 in Berlin! > > Videos > Slides > Flickr Group > Public Etherpad > Bootstrapping Challenge > > All of the videos from Gluster Developer Summit are now live on our > YouTube channel, and slides are available in our Slideshare accounts. We've > also created a Flickr group, please add your photos of the event! > > https://www.youtube.com/user/GlusterCommunity > http://www.slideshare.net/GlusterCommunity > https://www.flickr.com/groups/glusterdevelopersummit2016/ > > We've also got a public etherpad for our comments from the event: > https://public.pad.fsfe.org/p/gluster-developer-summit-2016 > > Please feel free to add to this and help keep our momentum from this > event! I'm looking for the community maintainers to take a strong hand in > here to be able to tell us what they're focusing on this from this event > over the next three months. > > One thing that we didn't get to that I wanted to was a Community Bootstrap > Challenge, so let's do this as a hangout after the Community Meeting on > November 2nd. I'll send out a separate email on this describing the event, > and we'll all join in at 1pm UTC. > > As we're still working on a 3.9 release, and this would fit perfectly within a 3.9 release plan, I'll post about this again more directly as we get there. Watch for more! - amye > Anything I missed? > > Happy to take suggestions and comments about what else we'd want to see in > a Gluster Developer Summit! > > -- amye > > -- > Amye Scavarda | a...@redhat.com | Gluster Community Lead > Editing to add: As we're still working on a 3.9 release, and that would be a fantastic Community Bootstrap Challenge, I'm moving this around a bit. Rest assured, we'll do a hangout around this. For -- Amye Scavarda | a...@redhat.com | Gluster Community Lead ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster refusing to start
Hi, For some reason I cannot get gluster to run on 2 of 3 nodes, here is my fault finding so far, out of ideas at the moment. Googling "polkitd[3969]: Unregistered Authentication Agent for unix-process:7541:985551 (system bus name :1.78, object path /org/freedesktop/PolicyKit1/Auth" isnt getting me far so far. === [root@glusterp1 ~]# rpm -qa |grep gluster glusterfs-cli-3.8.5-1.el7.x86_64 glusterfs-libs-3.8.5-1.el7.x86_64 vdsm-gluster-4.18.13-1.el7.centos.noarch centos-release-gluster38-1.0-1.el7.centos.noarch glusterfs-fuse-3.8.5-1.el7.x86_64 glusterfs-client-xlators-3.8.5-1.el7.x86_64 glusterfs-server-3.8.5-1.el7.x86_64 glusterfs-3.8.5-1.el7.x86_64 glusterfs-geo-replication-3.8.5-1.el7.x86_64 glusterfs-api-3.8.5-1.el7.x86_64 [root@glusterp1 ~]# systemctl start glusterd.service Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details. [root@glusterp1 ~]# setenforce 1 [root@glusterp1 ~]# systemctl start glusterd.service Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details. [root@glusterp1 ~]# systemctl status glusterd.service ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2016-11-02 12:41:43 NZDT; 9s ago Process: 7760 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE) Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz systemd[1]: Starting GlusterFS, a clustered file-system server... Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service: control process exited, code=exited status=1 Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start GlusterFS, a clustered file-system server. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Unit glusterd.service entered failed state. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service failed. [root@glusterp1 ~]# journalctl -xe Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service: control process exited, code=exited status=1 Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start GlusterFS, a clustered file-system server. -- Subject: Unit glusterd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit glusterd.service has failed. -- -- The result is failed. Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: Unit glusterd.service entered failed state. Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service failed. Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz polkitd[3969]: Unregistered Authentication Agent for unix-process:7541:985551 (system bus name :1.78, object path /org/freedesktop/PolicyKit1/Auth Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus-daemon[1005]: dbus[1005]: avc: received setenforce notice (enforcing=1) Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus[1005]: avc: received setenforce notice (enforcing=1) Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus[1005]: [system] Reloaded configuration Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus-daemon[1005]: dbus[1005]: [system] Reloaded configuration Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz polkitd[3969]: Registered Authentication Agent for unix-process:7755:986850 (system bus name :1.79 [/usr/bin/pkttyagent --notify-fd 5 --fallback], Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz systemd[1]: Starting GlusterFS, a clustered file-system server... -- Subject: Unit glusterd.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit glusterd.service has begun starting up. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service: control process exited, code=exited status=1 Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start GlusterFS, a clustered file-system server. -- Subject: Unit glusterd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit glusterd.service has failed. -- -- The result is failed. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Unit glusterd.service entered failed state. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service failed. Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz polkitd[3969]: Unregistered Authentication Agent for unix-process:7755:986850 (system bus name :1.79, object path /org/freedesktop/PolicyKit1/Auth [root@glusterp1 ~]# === ___ Gluster-users mailing list Gluster-users@gluster.org
Re: [Gluster-users] strange memory consumption with libgfapi
On 11/01/2016 10:04 AM, Pavel Cernohorsky wrote: For those who are interested, colleague of mine found out the problem is this line: itable = inode_table_new (131072, new_subvol); in glfs-master.c (graph_setup function). That hard-coded number is huge! And looking at the history of Gluster sources, it seems that this number used to be a number of bytes, but it became number of inodes, but someone forgot to change this hard-coded value! Anybody from Red Hat here interested in fixing it? Of course. Although fixing bugs in Community GlusterFS is not limited to just Red Hat employees. Everyone who finds a bug is strongly encouraged to file a bug report at https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS (You are required to create an account to submit a bug.) In this case I have already opened a bug for this. You can follow its status at https://bugzilla.redhat.com/show_bug.cgi?id=1390614 And if you have the ability to fix it, you are strongly encouraged to submit your proposed fix to review.gluster.org. A HOWTO for submitting patches is at http://gluster.readthedocs.io/en/latest/Developer-guide/Simplified-Development-Workflow/ Regards, -- Kaleb ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] strange memory consumption with libgfapi
For those who are interested, colleague of mine found out the problem is this line: itable = inode_table_new (131072, new_subvol); in glfs-master.c (graph_setup function). That hard-coded number is huge! And looking at the history of Gluster sources, it seems that this number used to be a number of bytes, but it became number of inodes, but someone forgot to change this hard-coded value! Anybody from RedHat here interested in fixing it? Kind regards, Pavel On 10/25/2016 09:28 AM, Oleksandr Natalenko wrote: Hello. 25.10.2016 09:11, Pavel Cernohorsky wrote: Unfortunately it is not possible to use valgrind properly, because libgfapi seems to leak just by initializing and deinitializing (tested with different code). Use Valgrind with Massif tool. That would definitely help. ___ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users