Re: [Gluster-devel] [Gluster-users] A question of GlusterFS dentries!
+1 for "no-rewinddir-support" option in DHT. We are seeing very slow directory listing specially with 1500+ brick volume, 'ls' takes 20+ second with 1000+ files. On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappawrote: > > > - Original Message - >> From: "Keiviw" >> To: gluster-devel@gluster.org >> Sent: Tuesday, November 1, 2016 12:41:02 PM >> Subject: [Gluster-devel] A question of GlusterFS dentries! >> >> Hi, >> In GlusterFS distributed volumes, listing a non-empty directory was slow. >> Then I read the dht codes and found the reasons. But I was confused that >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not >> use multi-thread to read dentries from multiple bricks simultaneously. >> That's a question that's always puzzled me, Couly you please tell me >> something about this??? > > readdir across subvols is sequential mostly because we have to support > rewinddir(3). We need to maintain the mapping of offset and dentry across > multiple invocations of readdir. In other words if someone did a rewinddir to > an offset corresponding to earlier dentry, subsequent readdirs should return > same set of dentries what the earlier invocation of readdir returned. For > example, in an hypothetical scenario, readdir returned following dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get > following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the > dentries present on that subvol. However, across subvols it is the > responsibility of DHT to provide the above guarantee. Which means we > should've some well defined order in which we send readdir calls (Note that > order is not well defined if we do a parallel readdir across all subvols). > So, DHT has sequential readdir which is a well defined order of reading > dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to the > subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, > e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done > parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 > ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. > However, if we do a readdir sequentially across subvols there is only one > directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to > support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel > readdir can easily be implemented (which improves performance too). But > unfortunately rewinddir is still a POSIX requirement. This also opens up > another possibility of a "no-rewinddir-support" option in DHT, which if > enabled results in parallel readdirs across subvols. What I am not sure is > how many users still use rewinddir? If there is a critical mass which wants > performance with a tradeoff of no rewinddir support this can be a good > feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > >> >> >> >> >> >> >> ___ >> Gluster-devel mailing list >> Gluster-devel@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel > ___ > Gluster-users mailing list > gluster-us...@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-Maintainers] Gluster Test Thursday - Release 3.9
Tested Bitrot related aspects. Created data, enabled bitrot and created more data. The files were signed by the bitrot daemon. Simulated the corruption by editing a file directly in the backend. Triggered scrubbing (on demand). Found that the corrupted files were marked bad by the scrubber. Also ran general tests such as compiling gluster code base on the mount point, dbench. The tests passed properly. Still running some more tests. Will keep updated. Regards, Raghavendra On Fri, Nov 4, 2016 at 12:43 AM, Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > On Thu, Nov 3, 2016 at 4:42 PM, Pranith Kumar Karampuri < > pkara...@redhat.com> wrote: > >> >> >> On Thu, Nov 3, 2016 at 9:55 AM, Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >>> >>> >>> On Wed, Nov 2, 2016 at 7:00 PM, Krutika Dhananjay>>> wrote: >>> Just finished testing VM storage use-case. *Volume configuration used:* [root@srv-1 ~]# gluster volume info Volume Name: rep Type: Replicate Volume ID: 2c603783-c1da-49b7-8100-0238c777b731 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: srv-1:/bricks/rep1 Brick2: srv-2:/bricks/rep2 Brick3: srv-3:/bricks/rep4 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server features.shard: on cluster.granular-entry-heal: on cluster.locking-scheme: granular network.ping-timeout: 30 server.allow-insecure: on storage.owner-uid: 107 storage.owner-gid: 107 cluster.data-self-heal-algorithm: full Used FUSE to mount the volume locally on each of the 3 nodes (no external clients). shard-block-size - 4MB. *TESTS AND RESULTS:* *What works:* * Created 3 vm images, one per hypervisor. Installed fedora 24 on all of them. Used virt-manager for ease of setting up the environment. Installation went fine. All green. * Rebooted the vms. Worked fine. * Killed brick-1. Ran dd on the three vms to create a 'src' file. Captured their md5sum value. Verified that the gfid indices and name indices are created under .glusterfs/indices/xattrop and .glusterfs/indices/entry-changes respectively as they should. Brought the brick back up. Waited until heal completed. Captured md5sum again. They matched. * Killed brick-2. Copied 'src' file from the step above into new file using dd. Captured md5sum on the newly created file. Checksum matched. Waited for heal to finish. Captured md5sum again. Everything matched. * Repeated the test above with brick-3 being killed and brought back up after a while. Worked fine. At the end I also captured md5sums from the backend of the shards on the three replicas. They all were found to be in sync. So far so good. *What did NOT work:* * Started dd again on all 3 vms to copy the existing files to new files. While dd was running, I ran replace-brick to replace the third brick with a new brick on the same node with a different path. This caused dd on all three vms to simultaneously fail with "Input/Output error". I tried to read off the files, even that failed. Rebooted the vms. By this time, /.shard is in split-brain as per heal-info. And the vms seem to have suffered corruption and are in an irrecoverable state. I checked the logs. The pattern is very much similar to the one in the add-brick bug Lindsay reported here - https://bugzilla.redhat.com/sh ow_bug.cgi?id=1387878. Seems like something is going wrong each time there is a graph switch. @Aravinda and Pranith: I will need some time to debug this, if 3.9 release can wait until it is RC'd and fixed. Otherwise we will need to caution the users to not do replace-brick, add-brick etc (or any form of graph switch for that matter) *might* cause vm corruption, irrespective of whether the users are using FUSE or gfapi, in 3.9.0. Let me know what your decision is. >>> >>> Since this bug is not a regression let us document this as a known >>> issue. Let us do our best to get the fix in next release. >>> >>> I am almost done with testing afr and ec. >>> >>> For afr, leaks etc were not there in the tests I did. >>> But I am seeing performance drop for crawling related tests. >>> >>> This is with 3.9.0rc2 >>> running directory_crawl_create ... done (252.91 secs) >>> running directory_crawl ... done (104.83 secs) >>> running
Re: [Gluster-devel] quota-rename.t core in netbsd
yes! I checked some of the newer logs now, looks like the console message may be delayed in earlier console logs. I would have to check the TC prior to quota-rename.t as well (./tests/basic/quota-nfs.t ). Thanks and Regards, Sanoj On Fri, Nov 4, 2016 at 7:30 PM, Emmanuel Dreyfuswrote: > Sanoj Unnikrishnan wrote: > > > Ran the same steps as in the quota-rename.t (manually though. multiple > > times!), Could not reproduce the issue. > > But running the test framework hits the bug reliabily? > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota-rename.t core in netbsd
Sanoj Unnikrishnanwrote: > Ran the same steps as in the quota-rename.t (manually though. multiple > times!), Could not reproduce the issue. But running the test framework hits the bug reliabily? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] github:gluster/container-storage - team create request
On 2016-11-04 at 11:27 +0100, Michael Scherer wrote: > Le jeudi 03 novembre 2016 à 19:04 +0100, Michael Adam a écrit : > > Hi all, > > > > recently a new repo was created under the github/gluster org: > > > > github.com/gluster/container-storage > > > > This is supposed to become the home of gluster's container > > storage project. This is the project which brings > > gluster into the kubernetes/openshift container platform > > as a provider of perstistent storage volumes for the > > application containers, with gluster's service interface > > heketi (github.com/heketi/heketi) as the central hub > > between kubernetes/openshift and glusterfs. > > > > As of now, we only have the repo, and I hereby suggest > > the creation of a container-storage-admin team > > with admin powers on that repo, and I would request to be > > made a member of that team. > > We already have a ton of team, any reason to not reuse a existing one ? > > In the end, this is starting to become a mess (and while that's a lost > battle, I am all to fight against entropy), and we have to drain that > swamp some day, so better start now. I don't technically need a new team for that. I thought it would be a simple way to give more fine grained privileges to the new repo. I don't care *how* I get the privileges. So please feel free to do it better. What I need: full admin and write access to the gluster-storage repo including the right to manage other people's rights. Thanks, Michael signature.asc Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Possible problem introduced by http://review.gluster.org/15573
On Mon, Oct 24, 2016 at 12:39 PM, Xavier Hernandezwrote: > Hi Soumya, > > > On 21/10/16 16:15, Soumya Koduri wrote: > >> >> >> On 10/21/2016 06:35 PM, Soumya Koduri wrote: >> >>> Hi Xavi, >>> >>> On 10/21/2016 12:57 PM, Xavier Hernandez wrote: >>> Looking at the code, I think that the added fd_unref() should only be called if the fop preparation fails. Otherwise the callback already unreferences the fd. Code flow: * glfs_fsync_async_common() takes an fd ref and calls STACK_WIND passing that fd. * Just after that a ref is released. * When glfs_io_async_cbk() is called another ref is released. Note that if fop preparation fails, a single fd_unref() is called, but on success two fd_unref() are called. >>> >>> Sorry for the inconvenience caused. I think its patch#15573 hasn't >>> caused the problem but has highlighted another ref leak in the code. >>> >>> From the code I see that glfs_io_async_cbk() does fd_unref (glfd->fd) >>> but not the fd passed in STACK_WIND_COOKIE() of the fop. >>> >>> If I take any fop, for eg., >>> glfs_fsync_common() { >>> >>>fd = glfs_resolve_fd (glfd->fs, subvol, glfd); >>> >>> >>> } >>> >>> Here in glfs_resolve_fd () >>> >>> fd_t * >>> __glfs_resolve_fd (struct glfs *fs, xlator_t *subvol, struct glfs_fd >>> *glfd) >>> { >>> fd_t *fd = NULL; >>> >>> if (glfd->fd->inode->table->xl == subvol) >>> return fd_ref (glfd->fd); >>> >>> Here we can see that we are taking extra ref additional to the >>> >> ref already taken for glfd->fd. That means the caller of this >>> function >>> needs to fd_unref(fd) irrespective of subsequent fd_unref (glfd->fd). >>> >>> fd = __glfs_migrate_fd (fs, subvol, glfd); >>> if (!fd) >>> return NULL; >>> >>> >>> if (subvol == fs->active_subvol) { >>> fd_unref (glfd->fd); >>> glfd->fd = fd_ref (fd); >>> } >>> >>> I think the issue is here during graph_switch(). You have >>> >> mentioned as well that the crash happens post graph_switch. Maybe >>> here >>> we are missing an extra ref to be taken for fd additional to glfd->fd. I >>> need to look through __glfs_migrate_fd() to confirm that. But these are >>> my initial thoughts. >>> >> >> Looking into this, I think we should fix glfs_io_async_cbk() not to >> fd_unref(glfd->fd). glfd->fd should be active though out the lifetime of >> glfd (i.e, until it is closed). Thoughts? >> > > I don't know gfapi internals in deep, but at first sight I think this > would be the right think to do. Assuming that glfd will keep a reference to > the fd until it's destroyed, and that a glfd reference is taken during the > lifetime of each request that needs it, the fd_unref() in > glfd_io_async_cbk() seems unnecessary. I think it was there just to release > the fd acquired if glfs_resolve_fd(), but it's better to place it where > it's now. > > Another question is if we really need to take an additional reference in > glfs_resolve_fd() ? > This answers the first question too, we don't need the additional ref in glfs_resolve_fd() now that we have ref accounting in glfd. This confusion is because earlier there was no ref accounting for glfd and only refs were on fd_t. No, we don't need this additional reference. > > Can an fd returned by this function live more time than the associated > glfd in some circumstances ? > > Also could you please check if it is the second/subsequent fsync_async() >> call which results in crash? >> > > I'll try to test it as soon as possible, but this is on a server that we > need to put in production very soon and we have decided to go with fuse for > now. We'll have a lot of work to do this week. Once I have some free time > I'll build a test environment to check it, probably next week. > > I have not been able to test this out completely. Theoretically, I don't see any possibility where fd can outlive a glfd that is pointing to it. I have sent a patch[1] that a. fixes the crash b. handles the unref in failure cases c. but still has the duplicate refs with some explanation in commit message. d. adds a simple test for the same. I request that we take this patch in and make a release soon as it is affecting many community users and do the cleanup in another patch. [1] http://review.gluster.org/#/c/15768/ Thanks, Raghavendra Talur Xavi > > > >> Thanks, >> Soumya >> >> >>> Please let me know your comments. >>> >>> Thanks, >>> Soumya >>> >>> >>> Xavi On 21/10/16 09:03, Xavier Hernandez wrote: > Hi, > > I've just tried Gluster 3.8.5 with Proxmox using gfapi and I > consistently see a crash each time an attempt to connect to the volume > is made. > > The backtrace of the crash shows this: > > #0 pthread_spin_lock () at > ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24 > #1 0x7fe5345776a5 in