Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!

2016-11-01 Thread Serkan Çoban
+1 for "no-rewinddir-support" option in DHT.
We are seeing very slow directory listing specially with 1500+ brick
volume, 'ls' takes 20+ second with 1000+ files.

On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappa
 wrote:
>
>
> - Original Message -
>> From: "Keiviw" 
>> To: gluster-de...@gluster.org
>> Sent: Tuesday, November 1, 2016 12:41:02 PM
>> Subject: [Gluster-devel] A question of GlusterFS dentries!
>>
>> Hi,
>> In GlusterFS distributed volumes, listing a non-empty directory was slow.
>> Then I read the dht codes and found the reasons. But I was confused that
>> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not
>> use multi-thread to read dentries from multiple bricks simultaneously.
>> That's a question that's always puzzled me, Couly you please tell me
>> something about this???
>
> readdir across subvols is sequential mostly because we have to support 
> rewinddir(3). We need to maintain the mapping of offset and dentry across 
> multiple invocations of readdir. In other words if someone did a rewinddir to 
> an offset corresponding to earlier dentry, subsequent readdirs should return 
> same set of dentries what the earlier invocation of readdir returned. For 
> example, in an hypothetical scenario, readdir returned following dentries:
>
> 1. a, off=10
> 2. b, off=2
> 3. c, off=5
> 4. d, off=15
> 5. e, off=17
> 6. f, off=13
>
> Now if we did rewinddir to off 5 and issue readdir again we should get 
> following dentries:
> (c, off=5), (d, off=15), (e, off=17), (f, off=13)
>
> Within a subvol backend filesystem provides rewinddir guarantee for the 
> dentries present on that subvol. However, across subvols it is the 
> responsibility of DHT to provide the above guarantee. Which means we 
> should've some well defined order in which we send readdir calls (Note that 
> order is not well defined if we do a parallel readdir across all subvols). 
> So, DHT has sequential readdir which is a well defined order of reading 
> dentries.
>
> To give an example if we have another subvol - subvol2 - (in addiction to the 
> subvol above - say subvol1) with following listing:
> 1. g, off=16
> 2. h, off=20
> 3. i, off=3
> 4. j, off=19
>
> With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, 
> e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done 
> parallely):
>
> 1. A complete listing of the directory (which can be any one of 10P1 = 10 
> ways - I hope math is correct here).
> 2. Do rewinddir (20)
>
> We cannot predict what are the set of dentries that come _after_ offset 20. 
> However, if we do a readdir sequentially across subvols there is only one 
> directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to 
> support rewinddir.
>
> If there is no POSIX requirement for rewinddir support, I think a parallel 
> readdir can easily be implemented (which improves performance too). But 
> unfortunately rewinddir is still a POSIX requirement. This also opens up 
> another possibility of a "no-rewinddir-support" option in DHT, which if 
> enabled results in parallel readdirs across subvols. What I am not sure is 
> how many users still use rewinddir? If there is a critical mass which wants 
> performance with a tradeoff of no rewinddir support this can be a good 
> feature.
>
> +gluster-users to get an opinion on this.
>
> regards,
> Raghavendra
>
>>
>>
>>
>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> gluster-de...@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!

2016-11-01 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Keiviw" 
> Cc: gluster-de...@gluster.org, "gluster-users" 
> Sent: Wednesday, November 2, 2016 9:38:46 AM
> Subject: Re: [Gluster-devel] A question of GlusterFS dentries!
> 
> 
> 
> - Original Message -
> > From: "Keiviw" 
> > To: gluster-de...@gluster.org
> > Sent: Tuesday, November 1, 2016 12:41:02 PM
> > Subject: [Gluster-devel] A question of GlusterFS dentries!
> > 
> > Hi,
> > In GlusterFS distributed volumes, listing a non-empty directory was slow.
> > Then I read the dht codes and found the reasons. But I was confused that
> > GlusterFS dht travesed all the bricks(in the volume) sequentially,why not
> > use multi-thread to read dentries from multiple bricks simultaneously.
> > That's a question that's always puzzled me, Couly you please tell me
> > something about this???
> 
> readdir across subvols is sequential mostly because we have to support
> rewinddir(3). We need to maintain the mapping of offset and dentry across
> multiple invocations of readdir. In other words if someone did a rewinddir
> to an offset corresponding to earlier dentry, subsequent readdirs should
> return same set of dentries what the earlier invocation of readdir returned.
> For example, in an hypothetical scenario, readdir returned following
> dentries:
> 
> 1. a, off=10
> 2. b, off=2
> 3. c, off=5
> 4. d, off=15
> 5. e, off=17
> 6. f, off=13
> 
> Now if we did rewinddir to off 5 and issue readdir again we should get
> following dentries:
> (c, off=5), (d, off=15), (e, off=17), (f, off=13)
> 
> Within a subvol backend filesystem provides rewinddir guarantee for the
> dentries present on that subvol. However, across subvols it is the
> responsibility of DHT to provide the above guarantee. Which means we
> should've some well defined order in which we send readdir calls (Note that
> order is not well defined if we do a parallel readdir across all subvols).
> So, DHT has sequential readdir which is a well defined order of reading
> dentries.
> 
> To give an example if we have another subvol - subvol2 - (in addiction to the

s/addiction/addition/

> subvol above - say subvol1) with following listing:
> 1. g, off=16
> 2. h, off=20
> 3. i, off=3
> 4. j, off=19
> 
> With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d,
> e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir
> done parallely):
> 
> 1. A complete listing of the directory (which can be any one of 10P1 = 10

I think it is 10P10 = 3628800. But again it is not completely random selection 
as readdir on a single subvol still gives one ordering, so the value is much 
less. The point here is that there can be many possible listings with parallel 
readdir.

> ways - I hope math is correct here).
> 2. Do rewinddir (20)
> 
> We cannot predict what are the set of dentries that come _after_ offset 20.
> However, if we do a readdir sequentially across subvols there is only one
> directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to
> support rewinddir.
> 
> If there is no POSIX requirement for rewinddir support, I think a parallel
> readdir can easily be implemented (which improves performance too). But
> unfortunately rewinddir is still a POSIX requirement. This also opens up
> another possibility of a "no-rewinddir-support" option in DHT, which if
> enabled results in parallel readdirs across subvols. What I am not sure is
> how many users still use rewinddir? If there is a critical mass which wants
> performance with a tradeoff of no rewinddir support this can be a good
> feature.
> 
> +gluster-users to get an opinion on this.
> 
> regards,
> Raghavendra
> 
> > 
> > 
> > 
> > 
> > 
> > 
> > ___
> > Gluster-devel mailing list
> > gluster-de...@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] [Gluster-devel] A question of GlusterFS dentries!

2016-11-01 Thread Raghavendra Gowdappa


- Original Message -
> From: "Keiviw" 
> To: gluster-de...@gluster.org
> Sent: Tuesday, November 1, 2016 12:41:02 PM
> Subject: [Gluster-devel] A question of GlusterFS dentries!
> 
> Hi,
> In GlusterFS distributed volumes, listing a non-empty directory was slow.
> Then I read the dht codes and found the reasons. But I was confused that
> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not
> use multi-thread to read dentries from multiple bricks simultaneously.
> That's a question that's always puzzled me, Couly you please tell me
> something about this???

readdir across subvols is sequential mostly because we have to support 
rewinddir(3). We need to maintain the mapping of offset and dentry across 
multiple invocations of readdir. In other words if someone did a rewinddir to 
an offset corresponding to earlier dentry, subsequent readdirs should return 
same set of dentries what the earlier invocation of readdir returned. For 
example, in an hypothetical scenario, readdir returned following dentries:

1. a, off=10
2. b, off=2
3. c, off=5
4. d, off=15
5. e, off=17
6. f, off=13

Now if we did rewinddir to off 5 and issue readdir again we should get 
following dentries:
(c, off=5), (d, off=15), (e, off=17), (f, off=13)

Within a subvol backend filesystem provides rewinddir guarantee for the 
dentries present on that subvol. However, across subvols it is the 
responsibility of DHT to provide the above guarantee. Which means we should've 
some well defined order in which we send readdir calls (Note that order is not 
well defined if we do a parallel readdir across all subvols). So, DHT has 
sequential readdir which is a well defined order of reading dentries.

To give an example if we have another subvol - subvol2 - (in addiction to the 
subvol above - say subvol1) with following listing:
1. g, off=16
2. h, off=20
3. i, off=3
4. j, off=19

With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, 
f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done 
parallely):

1. A complete listing of the directory (which can be any one of 10P1 = 10 ways 
- I hope math is correct here).
2. Do rewinddir (20)

We cannot predict what are the set of dentries that come _after_ offset 20. 
However, if we do a readdir sequentially across subvols there is only one 
directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to 
support rewinddir.

If there is no POSIX requirement for rewinddir support, I think a parallel 
readdir can easily be implemented (which improves performance too). But 
unfortunately rewinddir is still a POSIX requirement. This also opens up 
another possibility of a "no-rewinddir-support" option in DHT, which if enabled 
results in parallel readdirs across subvols. What I am not sure is how many 
users still use rewinddir? If there is a critical mass which wants performance 
with a tradeoff of no rewinddir support this can be a good feature.

+gluster-users to get an opinion on this.

regards,
Raghavendra

> 
> 
> 
> 
> 
> 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Improving IOPS

2016-11-01 Thread Lindsay Mathieson
And after having posted about the dangers of premature optimisation ... 
any suggestion for improving IOPS? as per earlier suggestions I tried 
setting server.event-threads and client.event-threads to 4, but it made 
no real difference.



nb: the limiting factor on my cluster is the network (2 * 1G).


--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Shared Heal Times

2016-11-01 Thread Lindsay Mathieson
Just an update - after resetting all the heal "optimisations" :) I set, 
in general heals are much faster and back to normal. I've done several 
rolling upgrades with the servers since, rebooting each one in turn. 
Usually around 300 64MB shards will need healing after each boot. Its 
spends about 2-3 minutes doing some fairly intensive CPU, then another 
10 minutes to complete the heal. All up around 15 minutes per server. 
I'm more than satisfied with that.



So no real problem other than PEBKAC.


Moral of the story - as always, tuning settings for optimisation almost 
never works.




thanks,

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Following up on Community Bootstrap Challenge

2016-11-01 Thread Amye Scavarda
On Mon, Oct 24, 2016 at 10:20 AM, Amye Scavarda  wrote:

> Notes from our Gluster Developer Summit 2016 in Berlin!
>
> Videos
> Slides
> Flickr Group
> Public Etherpad
> Bootstrapping Challenge
>
> All of the videos from Gluster Developer Summit are now live on our
> YouTube channel, and slides are available in our Slideshare accounts. We've
> also created a Flickr group, please add your photos of the event!
>
> https://www.youtube.com/user/GlusterCommunity
> http://www.slideshare.net/GlusterCommunity
> https://www.flickr.com/groups/glusterdevelopersummit2016/
>
> We've also got a public etherpad for our comments from the event:
> https://public.pad.fsfe.org/p/gluster-developer-summit-2016
>
> Please feel free to add to this and help keep our momentum from this
> event! I'm looking for the community maintainers to take a strong hand in
> here to be able to tell us what they're focusing on this from this event
> over the next three months.
>
> One thing that we didn't get to that I wanted to was a Community Bootstrap
> Challenge, so let's do this as a hangout after the Community Meeting on
> November 2nd. I'll send out a separate email on this describing the event,
> and we'll all join in at 1pm UTC.
>
> As we're still working on a 3.9 release, and this would fit perfectly
within a 3.9 release plan, I'll post about this again more directly as we
get there.
Watch for more!

- amye


> Anything I missed?
>
> Happy to take suggestions and comments about what else we'd want to see in
> a Gluster Developer Summit!
>
> -- amye
>
> --
> Amye Scavarda | a...@redhat.com | Gluster Community Lead
>

Editing to add:
As we're still working on a 3.9 release, and that would be a fantastic
Community Bootstrap Challenge, I'm moving this around a bit.
Rest assured, we'll do a hangout around this.

For

-- 
Amye Scavarda | a...@redhat.com | Gluster Community Lead
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] gluster refusing to start

2016-11-01 Thread Thing
Hi,

For some reason I cannot get gluster to run on 2 of 3 nodes,

here is my fault finding so far, out of ideas at the moment.  Googling
"polkitd[3969]: Unregistered Authentication Agent for
unix-process:7541:985551 (system bus name :1.78, object path
/org/freedesktop/PolicyKit1/Auth"

isnt getting me far so far.

===

[root@glusterp1 ~]# rpm -qa |grep gluster

glusterfs-cli-3.8.5-1.el7.x86_64

glusterfs-libs-3.8.5-1.el7.x86_64

vdsm-gluster-4.18.13-1.el7.centos.noarch

centos-release-gluster38-1.0-1.el7.centos.noarch

glusterfs-fuse-3.8.5-1.el7.x86_64

glusterfs-client-xlators-3.8.5-1.el7.x86_64

glusterfs-server-3.8.5-1.el7.x86_64

glusterfs-3.8.5-1.el7.x86_64

glusterfs-geo-replication-3.8.5-1.el7.x86_64

glusterfs-api-3.8.5-1.el7.x86_64

[root@glusterp1 ~]# systemctl start glusterd.service

Job for glusterd.service failed because the control process exited with
error code. See "systemctl status glusterd.service" and "journalctl -xe"
for details.

[root@glusterp1 ~]# setenforce 1

[root@glusterp1 ~]# systemctl start glusterd.service

Job for glusterd.service failed because the control process exited with
error code. See "systemctl status glusterd.service" and "journalctl -xe"
for details.

[root@glusterp1 ~]# systemctl status glusterd.service

● glusterd.service - GlusterFS, a clustered file-system server

Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor
preset: disabled)

Active: failed (Result: exit-code) since Wed 2016-11-02 12:41:43 NZDT; 9s
ago

Process: 7760 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid
--log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE)


Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz systemd[1]: Starting
GlusterFS, a clustered file-system server...

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service:
control process exited, code=exited status=1

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start
GlusterFS, a clustered file-system server.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Unit
glusterd.service entered failed state.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service
failed.

[root@glusterp1 ~]# journalctl -xe

Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service:
control process exited, code=exited status=1

Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start
GlusterFS, a clustered file-system server.

-- Subject: Unit glusterd.service has failed

-- Defined-By: systemd

-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- 

-- Unit glusterd.service has failed.

-- 

-- The result is failed.

Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: Unit
glusterd.service entered failed state.

Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service
failed.

Nov 02 12:41:30 glusterp1.ods.graywitch.co.nz polkitd[3969]: Unregistered
Authentication Agent for unix-process:7541:985551 (system bus name :1.78,
object path /org/freedesktop/PolicyKit1/Auth

Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus-daemon[1005]:
dbus[1005]: avc: received setenforce notice (enforcing=1)

Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus[1005]: avc: received
setenforce notice (enforcing=1)

Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus[1005]: [system] Reloaded
configuration

Nov 02 12:41:38 glusterp1.ods.graywitch.co.nz dbus-daemon[1005]:
dbus[1005]: [system] Reloaded configuration

Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz polkitd[3969]: Registered
Authentication Agent for unix-process:7755:986850 (system bus name :1.79
[/usr/bin/pkttyagent --notify-fd 5 --fallback],

Nov 02 12:41:41 glusterp1.ods.graywitch.co.nz systemd[1]: Starting
GlusterFS, a clustered file-system server...

-- Subject: Unit glusterd.service has begun start-up

-- Defined-By: systemd

-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- 

-- Unit glusterd.service has begun starting up.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service:
control process exited, code=exited status=1

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Failed to start
GlusterFS, a clustered file-system server.

-- Subject: Unit glusterd.service has failed

-- Defined-By: systemd

-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- 

-- Unit glusterd.service has failed.

-- 

-- The result is failed.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: Unit
glusterd.service entered failed state.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz systemd[1]: glusterd.service
failed.

Nov 02 12:41:43 glusterp1.ods.graywitch.co.nz polkitd[3969]: Unregistered
Authentication Agent for unix-process:7755:986850 (system bus name :1.79,
object path /org/freedesktop/PolicyKit1/Auth

[root@glusterp1 ~]#
===
___
Gluster-users mailing list
Gluster-users@gluster.org

Re: [Gluster-users] strange memory consumption with libgfapi

2016-11-01 Thread Kaleb S. KEITHLEY

On 11/01/2016 10:04 AM, Pavel Cernohorsky wrote:

For those who are interested, colleague of mine found out the problem is
this line:

itable = inode_table_new (131072, new_subvol);

in glfs-master.c (graph_setup function). That hard-coded number is huge!
And looking at the history of Gluster sources, it seems that this number
used to be a number of bytes, but it became number of inodes, but
someone forgot to change this hard-coded value!

Anybody from Red Hat here interested in fixing it?



Of course. Although fixing bugs in Community GlusterFS is not limited to 
just Red Hat employees.


Everyone who finds a bug is strongly encouraged to file a bug report at 
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS


(You are required to create an account to submit a bug.)

In this case I have already opened a bug for this. You can follow its 
status at https://bugzilla.redhat.com/show_bug.cgi?id=1390614


And if you have the ability to fix it, you are strongly encouraged to 
submit your proposed fix to review.gluster.org. A HOWTO for submitting 
patches is at 
http://gluster.readthedocs.io/en/latest/Developer-guide/Simplified-Development-Workflow/


Regards,

--

Kaleb


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] strange memory consumption with libgfapi

2016-11-01 Thread Pavel Cernohorsky
For those who are interested, colleague of mine found out the problem is 
this line:


itable = inode_table_new (131072, new_subvol);

in glfs-master.c (graph_setup function). That hard-coded number is huge! 
And looking at the history of Gluster sources, it seems that this number 
used to be a number of bytes, but it became number of inodes, but 
someone forgot to change this hard-coded value!


Anybody from RedHat here interested in fixing it?

Kind regards,

Pavel


On 10/25/2016 09:28 AM, Oleksandr Natalenko wrote:

Hello.

25.10.2016 09:11, Pavel Cernohorsky wrote:

Unfortunately it is not
possible to use valgrind properly, because libgfapi seems to leak just
by initializing and deinitializing (tested with different code).


Use Valgrind with Massif tool. That would definitely help.


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users