Re: [Gluster-devel] [Gluster-users] A question of GlusterFS dentries!

2016-11-04 Thread Serkan Çoban
+1 for "no-rewinddir-support" option in DHT.
We are seeing very slow directory listing specially with 1500+ brick
volume, 'ls' takes 20+ second with 1000+ files.

On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappa
 wrote:
>
>
> - Original Message -
>> From: "Keiviw" 
>> To: gluster-devel@gluster.org
>> Sent: Tuesday, November 1, 2016 12:41:02 PM
>> Subject: [Gluster-devel] A question of GlusterFS dentries!
>>
>> Hi,
>> In GlusterFS distributed volumes, listing a non-empty directory was slow.
>> Then I read the dht codes and found the reasons. But I was confused that
>> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not
>> use multi-thread to read dentries from multiple bricks simultaneously.
>> That's a question that's always puzzled me, Couly you please tell me
>> something about this???
>
> readdir across subvols is sequential mostly because we have to support 
> rewinddir(3). We need to maintain the mapping of offset and dentry across 
> multiple invocations of readdir. In other words if someone did a rewinddir to 
> an offset corresponding to earlier dentry, subsequent readdirs should return 
> same set of dentries what the earlier invocation of readdir returned. For 
> example, in an hypothetical scenario, readdir returned following dentries:
>
> 1. a, off=10
> 2. b, off=2
> 3. c, off=5
> 4. d, off=15
> 5. e, off=17
> 6. f, off=13
>
> Now if we did rewinddir to off 5 and issue readdir again we should get 
> following dentries:
> (c, off=5), (d, off=15), (e, off=17), (f, off=13)
>
> Within a subvol backend filesystem provides rewinddir guarantee for the 
> dentries present on that subvol. However, across subvols it is the 
> responsibility of DHT to provide the above guarantee. Which means we 
> should've some well defined order in which we send readdir calls (Note that 
> order is not well defined if we do a parallel readdir across all subvols). 
> So, DHT has sequential readdir which is a well defined order of reading 
> dentries.
>
> To give an example if we have another subvol - subvol2 - (in addiction to the 
> subvol above - say subvol1) with following listing:
> 1. g, off=16
> 2. h, off=20
> 3. i, off=3
> 4. j, off=19
>
> With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, 
> e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done 
> parallely):
>
> 1. A complete listing of the directory (which can be any one of 10P1 = 10 
> ways - I hope math is correct here).
> 2. Do rewinddir (20)
>
> We cannot predict what are the set of dentries that come _after_ offset 20. 
> However, if we do a readdir sequentially across subvols there is only one 
> directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to 
> support rewinddir.
>
> If there is no POSIX requirement for rewinddir support, I think a parallel 
> readdir can easily be implemented (which improves performance too). But 
> unfortunately rewinddir is still a POSIX requirement. This also opens up 
> another possibility of a "no-rewinddir-support" option in DHT, which if 
> enabled results in parallel readdirs across subvols. What I am not sure is 
> how many users still use rewinddir? If there is a critical mass which wants 
> performance with a tradeoff of no rewinddir support this can be a good 
> feature.
>
> +gluster-users to get an opinion on this.
>
> regards,
> Raghavendra
>
>>
>>
>>
>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> ___
> Gluster-users mailing list
> gluster-us...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Gluster Test Thursday - Release 3.9

2016-11-04 Thread FNU Raghavendra Manjunath
Tested Bitrot related aspects. Created data, enabled bitrot and created
more data. The files were signed by the bitrot daemon. Simulated the
corruption by editing a file directly in the backend.
Triggered scrubbing (on demand). Found that the corrupted files were marked
bad by the scrubber.

Also ran general tests such as compiling gluster code base on the mount
point, dbench. The tests passed properly.

Still running some more tests. Will keep updated.

Regards,
Raghavendra


On Fri, Nov 4, 2016 at 12:43 AM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Thu, Nov 3, 2016 at 4:42 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Thu, Nov 3, 2016 at 9:55 AM, Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Nov 2, 2016 at 7:00 PM, Krutika Dhananjay 
>>> wrote:
>>>
 Just finished testing VM storage use-case.

 *Volume configuration used:*

 [root@srv-1 ~]# gluster volume info

 Volume Name: rep
 Type: Replicate
 Volume ID: 2c603783-c1da-49b7-8100-0238c777b731
 Status: Started
 Snapshot Count: 0
 Number of Bricks: 1 x 3 = 3
 Transport-type: tcp
 Bricks:
 Brick1: srv-1:/bricks/rep1
 Brick2: srv-2:/bricks/rep2
 Brick3: srv-3:/bricks/rep4
 Options Reconfigured:
 nfs.disable: on
 performance.readdir-ahead: on
 transport.address-family: inet
 performance.quick-read: off
 performance.read-ahead: off
 performance.io-cache: off
 performance.stat-prefetch: off
 cluster.eager-lock: enable
 network.remote-dio: enable
 cluster.quorum-type: auto
 cluster.server-quorum-type: server
 features.shard: on
 cluster.granular-entry-heal: on
 cluster.locking-scheme: granular
 network.ping-timeout: 30
 server.allow-insecure: on
 storage.owner-uid: 107
 storage.owner-gid: 107
 cluster.data-self-heal-algorithm: full

 Used FUSE to mount the volume locally on each of the 3 nodes (no
 external clients).
 shard-block-size - 4MB.

 *TESTS AND RESULTS:*

 *What works:*

 * Created 3 vm images, one per hypervisor. Installed fedora 24 on all
 of them.
   Used virt-manager for ease of setting up the environment.
 Installation went fine. All green.

 * Rebooted the vms. Worked fine.

 * Killed brick-1. Ran dd on the three vms to create a 'src' file.
 Captured their md5sum value. Verified that
 the gfid indices and name indices are created under
 .glusterfs/indices/xattrop and .glusterfs/indices/entry-changes
 respectively as they should. Brought the brick back up. Waited until heal
 completed. Captured md5sum again. They matched.

 * Killed brick-2. Copied 'src' file from the step above into new file
 using dd. Captured md5sum on the newly created file.
 Checksum matched. Waited for heal to finish. Captured md5sum again.
 Everything matched.

 * Repeated the test above with brick-3 being killed and brought back up
 after a while. Worked fine.

 At the end I also captured md5sums from the backend of the shards on
 the three replicas. They all were found to be
 in sync. So far so good.

 *What did NOT work:*

 * Started dd again on all 3 vms to copy the existing files to new
 files. While dd was running, I ran replace-brick to replace the third brick
 with a new brick on the same node with a different path. This caused dd on
 all three vms to simultaneously fail with "Input/Output error". I tried to
 read off the files, even that failed. Rebooted the vms. By this time,
 /.shard is in
 split-brain as per heal-info. And the vms seem to have suffered
 corruption and are in an irrecoverable state.

 I checked the logs. The pattern is very much similar to the one in the
 add-brick bug Lindsay reported here - https://bugzilla.redhat.com/sh
 ow_bug.cgi?id=1387878. Seems like something is going wrong each time
 there is a graph switch.

 @Aravinda and Pranith:

 I will need some time to debug this, if 3.9 release can wait until it
 is RC'd and fixed.
 Otherwise we will need to caution the users to not do replace-brick,
 add-brick etc (or any form of graph switch for that matter) *might* cause
 vm corruption, irrespective of whether the users are using FUSE or gfapi,
 in 3.9.0.

 Let me know what your decision is.

>>>
>>> Since this bug is not a regression let us document this as a known
>>> issue. Let us do our best to get the fix in next release.
>>>
>>> I am almost done with testing afr and ec.
>>>
>>> For afr, leaks etc were not there in the tests I did.
>>> But I am seeing performance drop for crawling related tests.
>>>
>>> This is with 3.9.0rc2
>>> running directory_crawl_create ... done (252.91 secs)
>>> running directory_crawl ... done (104.83 secs)
>>> running 

Re: [Gluster-devel] quota-rename.t core in netbsd

2016-11-04 Thread Sanoj Unnikrishnan
yes!

I checked some of the newer logs now, looks like the console message may be
delayed in earlier console logs.

I would have to check the TC prior to quota-rename.t as well
(./tests/basic/quota-nfs.t ).
Thanks and Regards,
Sanoj

On Fri, Nov 4, 2016 at 7:30 PM, Emmanuel Dreyfus  wrote:

> Sanoj Unnikrishnan  wrote:
>
> > Ran the same steps as in the quota-rename.t (manually though. multiple
> > times!), Could not reproduce the issue.
>
> But running the test framework hits the bug reliabily?
>
>
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] quota-rename.t core in netbsd

2016-11-04 Thread Emmanuel Dreyfus
Sanoj Unnikrishnan  wrote:

> Ran the same steps as in the quota-rename.t (manually though. multiple
> times!), Could not reproduce the issue.

But running the test framework hits the bug reliabily?


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] github:gluster/container-storage - team create request

2016-11-04 Thread Michael Adam
On 2016-11-04 at 11:27 +0100, Michael Scherer wrote:
> Le jeudi 03 novembre 2016 à 19:04 +0100, Michael Adam a écrit :
> > Hi all,
> > 
> > recently a new repo was created under the github/gluster org:
> > 
> > github.com/gluster/container-storage
> > 
> > This is supposed to become the home of gluster's container
> > storage project. This is the project which brings
> > gluster into the kubernetes/openshift container platform
> > as a provider of perstistent storage volumes for the
> > application containers, with gluster's service interface
> > heketi (github.com/heketi/heketi) as the central hub
> > between kubernetes/openshift and glusterfs.
> > 
> > As of now, we only have the repo, and I hereby suggest
> > the creation of a container-storage-admin team
> > with admin powers on that repo, and I would request to be
> > made a member of that team.
> 
> We already have a ton of team, any reason to not reuse a existing one ?
> 
> In the end, this is starting to become a mess (and while that's a lost
> battle, I am all to fight against entropy), and we have to drain that
> swamp some day, so better start now.

I don't technically need a new team for that.
I thought it would be a simple way to give more
fine grained privileges to the new repo.

I don't care *how* I get the privileges.
So please feel free to do it better. What I need:

  full admin and write access to the gluster-storage repo
  including the right to manage other people's rights.

Thanks,

Michael


signature.asc
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573

2016-11-04 Thread Raghavendra Talur
On Mon, Oct 24, 2016 at 12:39 PM, Xavier Hernandez 
wrote:

> Hi Soumya,
>
>
> On 21/10/16 16:15, Soumya Koduri wrote:
>
>>
>>
>> On 10/21/2016 06:35 PM, Soumya Koduri wrote:
>>
>>> Hi Xavi,
>>>
>>> On 10/21/2016 12:57 PM, Xavier Hernandez wrote:
>>>
 Looking at the code, I think that the added fd_unref() should only be
 called if the fop preparation fails. Otherwise the callback already
 unreferences the fd.

 Code flow:

 * glfs_fsync_async_common() takes an fd ref and calls STACK_WIND passing
 that fd.
 * Just after that a ref is released.
 * When glfs_io_async_cbk() is called another ref is released.

 Note that if fop preparation fails, a single fd_unref() is called, but
 on success two fd_unref() are called.

>>>
>>> Sorry for the inconvenience caused. I think its patch#15573 hasn't
>>> caused the problem but has highlighted another ref leak in the code.
>>>
>>> From the code I see that glfs_io_async_cbk() does fd_unref (glfd->fd)
>>> but not the fd passed in STACK_WIND_COOKIE() of the fop.
>>>
>>> If I take any fop, for eg.,
>>> glfs_fsync_common() {
>>>
>>>fd = glfs_resolve_fd (glfd->fs, subvol, glfd);
>>>
>>>
>>> }
>>>
>>> Here in glfs_resolve_fd ()
>>>
>>> fd_t *
>>> __glfs_resolve_fd (struct glfs *fs, xlator_t *subvol, struct glfs_fd
>>> *glfd)
>>> {
>>> fd_t *fd = NULL;
>>>
>>> if (glfd->fd->inode->table->xl == subvol)
>>> return fd_ref (glfd->fd);
>>>
>>> Here we can see that  we are taking extra ref additional to the
>>>
>> ref already taken for glfd->fd. That means the caller of this
>>> function
>>> needs to fd_unref(fd) irrespective of subsequent fd_unref (glfd->fd).
>>>
>>> fd = __glfs_migrate_fd (fs, subvol, glfd);
>>> if (!fd)
>>> return NULL;
>>>
>>>
>>> if (subvol == fs->active_subvol) {
>>> fd_unref (glfd->fd);
>>> glfd->fd = fd_ref (fd);
>>> }
>>>
>>> I think the issue is here during graph_switch(). You have
>>>
>> mentioned as well that the crash happens post graph_switch. Maybe
>>> here
>>> we are missing an extra ref to be taken for fd additional to glfd->fd. I
>>> need to look through __glfs_migrate_fd() to confirm that. But these are
>>> my initial thoughts.
>>>
>>
>> Looking into this, I think we should fix glfs_io_async_cbk() not to
>> fd_unref(glfd->fd). glfd->fd should be active though out the lifetime of
>> glfd (i.e, until it is closed). Thoughts?
>>
>
> I don't know gfapi internals in deep, but at first sight I think this
> would be the right think to do. Assuming that glfd will keep a reference to
> the fd until it's destroyed, and that a glfd reference is taken during the
> lifetime of each request that needs it, the fd_unref() in
> glfd_io_async_cbk() seems unnecessary. I think it was there just to release
> the fd acquired if glfs_resolve_fd(), but it's better to place it where
> it's now.
>
> Another question is if we really need to take an additional reference in
> glfs_resolve_fd() ?
>

This answers the first question too, we don't need the additional ref in
glfs_resolve_fd() now that we have ref accounting in glfd. This confusion
is because earlier there was no ref accounting for glfd and only refs were
on fd_t.
No, we don't need this additional reference.


>
> Can an fd returned by this function live more time than the associated
> glfd in some circumstances ?
>
> Also could you please check if it is the second/subsequent fsync_async()
>> call which results in crash?
>>
>
> I'll try to test it as soon as possible, but this is on a server that we
> need to put in production very soon and we have decided to go with fuse for
> now. We'll have a lot of work to do this week. Once I have some free time
> I'll build a test environment to check it, probably next week.
>
>
I have not been able to test this out completely. Theoretically, I don't
see any possibility where fd can outlive a glfd that is pointing to it.
I have sent a patch[1] that
a. fixes the crash
b. handles the unref in failure cases
c. but still has the duplicate refs with some explanation in commit message.
d. adds a simple test for the same.

I request that we take this patch in and make a release soon as it is
affecting many community users and do the cleanup in another patch.

[1] http://review.gluster.org/#/c/15768/

Thanks,
Raghavendra Talur

Xavi
>
>
>
>> Thanks,
>> Soumya
>>
>>
>>> Please let me know your comments.
>>>
>>> Thanks,
>>> Soumya
>>>
>>>
>>>
 Xavi

 On 21/10/16 09:03, Xavier Hernandez wrote:

> Hi,
>
> I've just tried Gluster 3.8.5 with Proxmox using gfapi and I
> consistently see a crash each time an attempt to connect to the volume
> is made.
>
> The backtrace of the crash shows this:
>
> #0  pthread_spin_lock () at
> ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
> #1  0x7fe5345776a5 in