[Gluster-devel] GlusterFS-3.7.6-2 packages for Debian Wheezy now available
Hi, If you're a Debian Wheezy user please give the new packages a try. Thanks -- Kaleb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Trashcan issue with vim editor
On Fri, 2016-01-29 at 18:59 +0530, PankaJ Singh wrote: > Hi, > > Thanks Anoop for the help, > Would you please tell me when can we expect this new release with > this bug fix. > Please find the corresponding patch posted for mainline at [1]. I am not sure whether we can back port the same and include it for 3.7.8. I will update the thread asap. [1] https://review.gluster.org/#/c/13346/ --Anoop C S. > > Thanks & Reagrds > > > > On Fri, Jan 29, 2016 at 12:42 PM, Anoop C S > wrote: > > On Wed, 2016-01-27 at 15:25 +0530, PankaJ Singh wrote: > > > > > > Hi, > > > > > > We are using gluster 3.7.6 on ubuntu 14.04. We are facing an > > issue > > > with trashcan feature. > > > Our scenario is as follow: > > > > > > 1. 2 node server (ubuntu 14.04 with glusterfs 3.7.6) > > > 2. 1 client node (ubuntu 14.04) > > > 3. I have created one volume vol1 with 2 bricks in replica and > > with > > > transport = tcp mode. > > > 4. I have enabled quota on vol1 > > > 5. Now I have enabled trashcan feature on vol1 > > > 6. Now I have mounted vol1 on client's home directory "mount -t > > > glusterfs -o transport=tcp server-1:/vol1 /home/" > > > 7. Now when I logged in via any existing non-root user and > > perform > > > any editing via vim editor then I getting this error "E200: > > *ReadPre > > > autocommands made the file unreadable" and my user's home > > > directory permission get changed to 000. after sometime these > > > permission gets revert back automatically. > > > > > > (NOTE: user's home directories are copied in mounted directory > > > glusterfs volume vol1) > > > > > > > As discussed over irc, we will definitely look into this issue [1] > > and > > get back asap. On the other side, I have some solid reasons in > > recommending not to use swap/backup files, created/used by Vim, > > when > > trash is enabled for a volume (assuming you have the basic vimrc > > config > > where swap/backup files are enabled by default): > > > > 1. You will see lot of foo.swpx/foo.swp files (with time stamp > > appended > > in their filenames) inside trashcan as Vim creates and removes > > these > > swap files every now and then. > > > > 2. Regarding backup files, you will notice a list of 4913 named > > files > > inside .trashcan. These files are created and deleted by Vim to > > make > > sure that it can create files in the current directory. And of > > course every time you save it with :w. > > > > 3. Similar is the case with undo files like .foo.un~. > > > > 4. Last but not the least, every time you do a :w, Vim performs a > > truncate operation which will cause the previous version of file > > to > > be moved to .trashcan. > > > > Having said that, you can insert the following lines to your vimrc > > file > > to prevent those unnecessary files, described through first 3 > > points, > > to land inside .trashcan. > > > > set noundofile > > set noswapfile > > set nobackup > > set nowritebackup > > > > As per the current implementation, we cannot prevent previous > > versions > > of file being created inside trash directory and I think these > > files > > will serve as backup files for future which is a good to have > > feature. > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1302307 > > > > --Anoop C S > > > > > > > > Thanks & Regards > > > PankaJ Singh > > > ___ > > > Gluster-users mailing list > > > gluster-us...@gluster.org > > > http://www.gluster.org/mailman/listinfo/gluster-users > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Non-blocking lock for renames
- Original Message - > From: "Raghavendra Gowdappa" > To: "Vijay Bellur" > Cc: "Gluster Devel" > Sent: Thursday, February 4, 2016 11:28:29 AM > Subject: Re: [Gluster-devel] Non-blocking lock for renames > > > > - Original Message - > > From: "Vijay Bellur" > > To: "Shyamsundar Ranganathan" , "Raghavendra Gowdappa" > > > > Cc: "Gluster Devel" > > Sent: Thursday, February 4, 2016 9:55:04 AM > > Subject: Non-blocking lock for renames > > > > DHT developers, > > > > We introduced a non-blocking lock prior to a rename operation in dht and > > fail the rename if the lock acquisition is not successful with 3.6. I > > ran into an user in IRC yesterday who is affected by this behavior change: > > > > "We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x > > and we're not sure how to fix it. When multiple processes are attempting > > to rename a file to the same destination at once, we're now seeing > > "Device or resource busy" and "Stale file handle" errors. Here's the > > command to replicate it: cd /mnt/glustermount; while true; do > > FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command > > would be ran on two or three servers within the same gluster cluster. In > > the output, one would always be sucessfull in the rename, while the 2 > > other ones would fail with the above error." > > > > The use case for concurrent renames was described as: > > > > "we generate files and push them to the gluster cluster. Some are > > generated multiple times and end up being pushed to the cluster at the > > same time by different data generators; resulting in the 'rename > > collision'. We use also the cluster.extra-hash-regex to make sure the > > data is written in place. And this does the rename." > > > > Is a non-blocking lock essential? Can we not use a blocking lock instead > > of a non-blocking lock or fallback to a blocking lock if the original > > non-blocking lock acquisition fails? > > This lock synchronizes: > 1. rename from application with file migration from rebalance process [1]. > 2. multiple renames from application on same file. > > I think lock is still required for 1. However, since migration can > potentially take large time, we chose a non-blocking lock to make sure > application is not blocked for longer period. > > The case 2 is what causing the issue mentioned in this thread. We did see > some files being removed with parallel renames on the same file. But, by the > time we had identified that its a bug in 'mv' (mv issues an unlink on src if > src and dst happens to be hardlinks [2]. But test for hardlink check and > unlink are not atomic. Dht breaks rename into a series of links and > unlinks), we had introduced synchronizing b/w renames. So, we have two > options: > > 1. Use different domains for use cases 1 and 2 above. With different domains, > use-case 2 above can be changed to use blocking locks. It might not be > advisable to use blocking locks for use-case 1. > 2. Since we identified the issue is with mv (I couldn't find another bug we > filed on mv, but [2] is close to it), probably we don't need locking in 2 at > all. > > Suggestions? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=969298#c8 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=438076 Found the bug, we had filed on mv: [2] https://bugzilla.redhat.com/show_bug.cgi?id=1141368 > > regards, > Raghavendra > > > > Thanks, > > Vijay > > > > > > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/04/2016 09:38 AM, Vijay Bellur wrote: On 02/03/2016 11:34 AM, Venky Shankar wrote: On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. Compounding in this case can help, but still without the cache, the read has to go to the DS, and on such a compounding, the MDS would reach out to the DS for the information than the client. Another possibility based on what we decide as the cache mechanism. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. Venky, when you say persisting, I assume on disk, is that right? If so, then the MDS storage size requirements would increase (based on amount of file data that need to be stored). As of now it is only inodes, and as we move to a db a record. In this case we may have *fatter* MDS partitions. Any comments/thoughts on that? As with memory I would assume some form of eviction of data from MDS, to control the space utilization here as a possibility. I like the idea. However the memory implications of maintaining content in MDS is something to watch out for. quick-read is interested in files of size 64k by default and with a reasonable number of files in that range, we might end up consuming significant memory with this scheme. Vijay, I think what Venky states is to stash the file on the local storage and not in memory. If it was in memory then brick process restarts would nuke the cache, and either we need mechanisms to rebuild/warm the cache or just start caching afresh. If we were caching in memory, then yes the concern is valid, and one possibility is some form of LRU for the same, to keep memory consumption in check. Overall I would steer away from memory for this use case, and use the disk, as we do not know which files to cache (well in either case, but disk offers us more space to possibly punt on that issue). For files where the cache is missing and the file is small enough, either perform async read from the client (gaining some overlap time with the app) or just let it be, as we would get the open/read anyway, but would slow things down. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Non-blocking lock for renames
- Original Message - > From: "Vijay Bellur" > To: "Shyamsundar Ranganathan" , "Raghavendra Gowdappa" > > Cc: "Gluster Devel" > Sent: Thursday, February 4, 2016 9:55:04 AM > Subject: Non-blocking lock for renames > > DHT developers, > > We introduced a non-blocking lock prior to a rename operation in dht and > fail the rename if the lock acquisition is not successful with 3.6. I > ran into an user in IRC yesterday who is affected by this behavior change: > > "We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x > and we're not sure how to fix it. When multiple processes are attempting > to rename a file to the same destination at once, we're now seeing > "Device or resource busy" and "Stale file handle" errors. Here's the > command to replicate it: cd /mnt/glustermount; while true; do > FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command > would be ran on two or three servers within the same gluster cluster. In > the output, one would always be sucessfull in the rename, while the 2 > other ones would fail with the above error." > > The use case for concurrent renames was described as: > > "we generate files and push them to the gluster cluster. Some are > generated multiple times and end up being pushed to the cluster at the > same time by different data generators; resulting in the 'rename > collision'. We use also the cluster.extra-hash-regex to make sure the > data is written in place. And this does the rename." > > Is a non-blocking lock essential? Can we not use a blocking lock instead > of a non-blocking lock or fallback to a blocking lock if the original > non-blocking lock acquisition fails? This lock synchronizes: 1. rename from application with file migration from rebalance process [1]. 2. multiple renames from application on same file. I think lock is still required for 1. However, since migration can potentially take large time, we chose a non-blocking lock to make sure application is not blocked for longer period. The case 2 is what causing the issue mentioned in this thread. We did see some files being removed with parallel renames on the same file. But, by the time we had identified that its a bug in 'mv' (mv issues an unlink on src if src and dst happens to be hardlinks [2]. But test for hardlink check and unlink are not atomic. Dht breaks rename into a series of links and unlinks), we had introduced synchronizing b/w renames. So, we have two options: 1. Use different domains for use cases 1 and 2 above. With different domains, use-case 2 above can be changed to use blocking locks. It might not be advisable to use blocking locks for use-case 1. 2. Since we identified the issue is with mv (I couldn't find another bug we filed on mv, but [2] is close to it), probably we don't need locking in 2 at all. Suggestions? [1] https://bugzilla.redhat.com/show_bug.cgi?id=969298#c8 [2] https://bugzilla.redhat.com/show_bug.cgi?id=438076 regards, Raghavendra > > Thanks, > Vijay > > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Contributing to Gluster
Hi Willy, Its great to see your interest in GlusterFS. But due to some changes in the lookup related area the current lookup-optimize design does not hold good. Do you have any other areas in GlusterFS that you would like to work on. Please let us know so that we can help you further. - Original Message - From: "Kaushal M" To: "Willy Soesanto" Cc: gluster-devel@gluster.org, "Shyam" , saban...@redhat.com Sent: Wednesday, February 3, 2016 8:15:32 PM Subject: Re: [Gluster-devel] Contributing to Gluster Maybe Shyam and Sakshi (in cc) can be helpful on this topic. They've been involved in implementation of lookup-optimize. ~kaushal On Tue, Feb 2, 2016 at 5:10 PM, Willy Soesanto wrote: > Hi Gluster-Devs, > > My name is Willy. I am a final year undergraduate student from Bandung > Institute of Technology. My final year project is about Gluster. After I > researched for a while about Gluster, I would like to take the task about > working on lookup-self heal > (https://public.pad.fsfe.org/p/dht_lookup_optimize). Are there any steps I > should follow beforehand? > > Thanks, > > Willy > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Non-blocking lock for renames
DHT developers, We introduced a non-blocking lock prior to a rename operation in dht and fail the rename if the lock acquisition is not successful with 3.6. I ran into an user in IRC yesterday who is affected by this behavior change: "We're seeing a behavior in Gluster 3.7.x that we did not see in 3.4.x and we're not sure how to fix it. When multiple processes are attempting to rename a file to the same destination at once, we're now seeing "Device or resource busy" and "Stale file handle" errors. Here's the command to replicate it: cd /mnt/glustermount; while true; do FILE=$RANDOM; touch $FILE; mv $FILE file-fv; done. The above command would be ran on two or three servers within the same gluster cluster. In the output, one would always be sucessfull in the rename, while the 2 other ones would fail with the above error." The use case for concurrent renames was described as: "we generate files and push them to the gluster cluster. Some are generated multiple times and end up being pushed to the cluster at the same time by different data generators; resulting in the 'rename collision'. We use also the cluster.extra-hash-regex to make sure the data is written in place. And this does the rename." Is a non-blocking lock essential? Can we not use a blocking lock instead of a non-blocking lock or fallback to a blocking lock if the original non-blocking lock acquisition fails? Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 11:34 AM, Venky Shankar wrote: On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. I like the idea. However the memory implications of maintaining content in MDS is something to watch out for. quick-read is interested in files of size 64k by default and with a reasonable number of files in that range, we might end up consuming significant memory with this scheme. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: > > Problem is with workloads which know the files that need to be read > > without readdir, like hyperlinks (webserver), swift objects etc. These > > are two I know of which will have this problem, which can't be improved > > because we don't have metadata, data co-located. I have been trying to > > think of a solution for past few days. Nothing good is coming up :-/ > > In those cases, caching (at the MDS) would certainly help a lot. Some > variation of the compounding infrastructure under development for Samba > etc. might also apply, since this really is a compound operation. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. But, when there are open file descriptors on an inode (O_RDWR || O_WRONLY on a file), the size cannot be trusted (as MDS only knows about the updated size after last close), which would be the degraded case. Thanks, Venky ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] REMINDER: Weekly Gluster Community meeting starts in 1 hour
I missed doing a #startmeeting, so we don't have meeting minutes for this weeks meeting. I've opened a request with fedora-infra to manually import the log [1], and I'll update the list with the minutes once complete. ~kaushal [1] https://fedorahosted.org/fedora-infrastructure/ticket/5091 On Wed, Feb 3, 2016 at 4:34 PM, Niels de Vos wrote: > > Hi all, > > The weekly Gluster community meeting is starting in 1 hour at 12:00 UTC. > The current agenda for the meeting is below. Add any further topics to > the agenda at https://public.pad.fsfe.org/p/gluster-community-meetings > > Meeting details: > - location: #gluster-meeting on Freenode IRC > - date: every Wednesday > - time: 8:00 EDT, 12:00 UTC, 13:00 CET, 17:30 IST > (in your terminal, run: date -d "12:00 UTC") > > Current Agenda: > * Roll Call > * AIs from last meeting > * GlusterFS 3.7 > * GlusterFS 3.6 > * GlusterFS 3.5 > * GlusterFS 3.8 > * GlusterFS 4.0 > * Open Floor > > See you there, > Niels > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 07:54 PM, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. The above is certainly an option, need to process it a bit more to respond sanely. Another one is to generate the GFID for a file with parGFID+basename as input (which was something Pranith brought a few mails back in this chain). There was concern that we will have GFID clashes, but further reasoning suggests that it would not. An example follows, Good cases: - /D1/File is created, with top 2 bytes of the files GFID as the bucket (same as D1 bucket), and rest of GFID as some UUID generation of pGFID (gfid of D1) + base name - When this file is looked up by name, its GFID can be generated at the client side as a hint, and the same fan out of lookup to MDS and read to DS can be initiated * Validity of the READ data, is good only when the lookup agrees on the same GFID for the file Bad cases: - On a rename, the GFID of the file does not change, and so if /D1/File was renamed to /D2/File1, then a subsequent lookup could fail to prefetch the read, as the GFID hint generated is now based on GFID of D2 and new name File1 - If post a rename /D1/File is again created, the GFID generated/requested by the client for this file would clash with the already generated GFID, hence the DHT server would decide to return a new GFID, that has no relation to the one generated by the hint. Again resulting in the nint failing So with the above scheme, as long as files are not renamed the hint serves its purpose to prefetch even with just the name and parGFID. One gotcha is that, I see a pattern with applications, that create a tmp file and then renames it to the real file name, sort of a swap file and then rename it to the real file as needed. For all such applications the hints above would fail. I believe even Swift also uses a similar trick on the FS to rename an object, once it is considered fully written to. Another case would be compile workload. So overall the above as a scheme could work to alleviate the problem somewhat, but may cause harm in others (where the GFID hint is incorrect and so we end up sending a read without reason). The above could easily be prototyped with DHT2 to see its benefits, so we will try that out at some point in the future. Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Contributing to Gluster
Maybe Shyam and Sakshi (in cc) can be helpful on this topic. They've been involved in implementation of lookup-optimize. ~kaushal On Tue, Feb 2, 2016 at 5:10 PM, Willy Soesanto wrote: > Hi Gluster-Devs, > > My name is Willy. I am a final year undergraduate student from Bandung > Institute of Technology. My final year project is about Gluster. After I > researched for a while about Gluster, I would like to take the task about > working on lookup-self heal > (https://public.pad.fsfe.org/p/dht_lookup_optimize). Are there any steps I > should follow beforehand? > > Thanks, > > Willy > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
> Problem is with workloads which know the files that need to be read > without readdir, like hyperlinks (webserver), swift objects etc. These > are two I know of which will have this problem, which can't be improved > because we don't have metadata, data co-located. I have been trying to > think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] REMINDER: Weekly Gluster Community meeting starts in 1 hour
Hi all, The weekly Gluster community meeting is starting in 1 hour at 12:00 UTC. The current agenda for the meeting is below. Add any further topics to the agenda at https://public.pad.fsfe.org/p/gluster-community-meetings Meeting details: - location: #gluster-meeting on Freenode IRC - date: every Wednesday - time: 8:00 EDT, 12:00 UTC, 13:00 CET, 17:30 IST (in your terminal, run: date -d "12:00 UTC") Current Agenda: * Roll Call * AIs from last meeting * GlusterFS 3.7 * GlusterFS 3.6 * GlusterFS 3.5 * GlusterFS 3.8 * GlusterFS 4.0 * Open Floor See you there, Niels signature.asc Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). Another manner of achieving the same when the GFID of the file is known (from a readdir) is to wind the lookup and read of size to the respective MDS and DS, where the lookup would be responded to once the MDS responds, and the DS response is cached for the subsequent open+read case. So on the wire we would have a fan out of 2 FOPs, but still satisfy the quick read requirements. Tar kind of workload doesn't have a problem because we know the gfid after readdirp. I would assume the above resolves the problem posted, are there cases where we do not know the GFID of the file? i.e no readdir performed and client knows the file name that it wants to operate on? Do we have traces of the webserver workload to see if it generates names on the fly or does a readdir prior to that? Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 09:20 AM, Shyam wrote: On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) This is interesting thanks for the heads up. Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). Another manner of achieving the same when the GFID of the file is known (from a readdir) is to wind the lookup and read of size to the respective MDS and DS, where the lookup would be responded to once the MDS responds, and the DS response is cached for the subsequent open+read case. So on the wire we would have a fan out of 2 FOPs, but still satisfy the quick read requirements. I would assume the above resolves the problem posted, are there cases where we do not know the GFID of the file? i.e no readdir performed and client knows the file name that it wants to operate on? Do we have traces of the webserver workload to see if it generates names on the fly or does a readdir prior to that? The open+read can be done as a single FOP, - open for a read only case can do access checking on the client to allow the FOP to proceed to the DS without hitting the MDS for an open token The client side cache is important from this and other such perspectives. It should also leverage upcall infra to keep the cache loosely coherent. One thing to note here would be, for the client to do a lookup (where the file name should be known before hand), either a readdir/(p) has to have happened, or the client knows the name already (say application generated names). For the former (readdir case), there is enough information on the client to not need a lookup, but rather just do the open+read on the DS. For the latter the first lookup cannot be avoided, degrading this to a lookup+(open+read). Some further tricks can be done to do readdir prefetching on such workloads, as the MDS runs on a DB (eventually), piggybacking more entries than requested on a lookup. I would possibly leave that for later, based on performance numbers in the small file area. Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel