Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
Pranith Kumar Karampuri writes: > On 02/03/2016 07:54 PM, Jeff Darcy wrote: >>> Problem is with workloads which know the files that need to be read >>> without readdir, like hyperlinks (webserver), swift objects etc. These >>> are two I know of which will have this problem, which can't be improved >>> because we don't have metadata, data co-located. I have been trying to >>> think of a solution for past few days. Nothing good is coming up :-/ >> In those cases, caching (at the MDS) would certainly help a lot. Some >> variation of the compounding infrastructure under development for Samba >> etc. might also apply, since this really is a compound operation. > Even with compound fops It will still require two sequential network > operations from dht2. One to MDC and one to DC So I don't think it helps. You can do better. You control the MDC. The MDC goes ahead and forwards the request, under the client's GUID. That'll cut 1/2 a RTT. Cheers, -Ira ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
> Even with compound fops It will still require two sequential network > operations from dht2. One to MDC and one to DC So I don't think it helps. There are still two hops, but making it a compound op keeps the server-to-server communication in the compounding translator (which should already be able to handle that case) instead of having to put it in the MDS. It's not so much a matter of improving performance - turning a client/server hop into server/server or server/cache should do that - but of implementing that improvement in a clean way. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On Thu, Feb 04, 2016 at 11:34:04AM +0530, Shyam wrote: > On 02/04/2016 09:38 AM, Vijay Bellur wrote: > >On 02/03/2016 11:34 AM, Venky Shankar wrote: > >>On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: > Problem is with workloads which know the files that need to be read > without readdir, like hyperlinks (webserver), swift objects etc. These > are two I know of which will have this problem, which can't be improved > because we don't have metadata, data co-located. I have been trying to > think of a solution for past few days. Nothing good is coming up :-/ > >>> > >>>In those cases, caching (at the MDS) would certainly help a lot. Some > >>>variation of the compounding infrastructure under development for Samba > >>>etc. might also apply, since this really is a compound operation. > > Compounding in this case can help, but still without the cache, the read has > to go to the DS, and on such a compounding, the MDS would reach out to the > DS for the information than the client. Another possibility based on what we > decide as the cache mechanism. > > >> > >>When a client is done modifying a file, MDS would refresh it's size, > >>mtime > >>attributes by fetching it from the DS. As part of this refresh, DS could > >>additionally send back the content if the file size falls in range, with > >>MDS persisting it, sending it back for subsequent lookup calls as it does > >>now. The content (on MDS) can be zapped once the file size crosses the > >>defined limit. > > Venky, when you say persisting, I assume on disk, is that right? Definitely on-disk. > > If so, then the MDS storage size requirements would increase (based on > amount of file data that need to be stored). As of now it is only inodes, > and as we move to a db a record. In this case we may have *fatter* MDS > partitions. Any comments/thoughts on that? The MDS storage requirement does go up by a considerable amount due to the fact that normally the number of MDS nodes would be far less in number than the DS nodes. So, yes, the MDS does become fat, but it's important to have data inline with it's inode to boost small file performance (at least when the file is not under modification). > > As with memory I would assume some form of eviction of data from MDS, to > control the space utilization here as a possibility. Maybe. Using TTL in a key-value store might be an option. But, IIRC, TTLs can be set for an entire record and not for parts of a record. We'd need to think more about this anyway. > > >> > > > >I like the idea. However the memory implications of maintaining content > >in MDS is something to watch out for. quick-read is interested in files > >of size 64k by default and with a reasonable number of files in that > >range, we might end up consuming significant memory with this scheme. > > Vijay, I think what Venky states is to stash the file on the local storage > and not in memory. If it was in memory then brick process restarts would > nuke the cache, and either we need mechanisms to rebuild/warm the cache or > just start caching afresh. > > If we were caching in memory, then yes the concern is valid, and one > possibility is some form of LRU for the same, to keep memory consumption in > check. As stated earlier, it's a persistent cache which may or may not have a layer of in-memory cache itself. I would leave all that to the key-value DB (when we use one) as it most probably would be doing that. > > Overall I would steer away from memory for this use case, and use the disk, > as we do not know which files to cache (well in either case, but disk offers > us more space to possibly punt on that issue). For files where the cache is > missing and the file is small enough, either perform async read from the > client (gaining some overlap time with the app) or just let it be, as we > would get the open/read anyway, but would slow things down. Yes. async reads for files which have missing inline data with inode plus satisfy the size range requirement. > > > > >-Vijay > >___ > >Gluster-devel mailing list > >Gluster-devel@gluster.org > >http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 07:54 PM, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. Even with compound fops It will still require two sequential network operations from dht2. One to MDC and one to DC So I don't think it helps. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/04/2016 09:38 AM, Vijay Bellur wrote: On 02/03/2016 11:34 AM, Venky Shankar wrote: On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. Compounding in this case can help, but still without the cache, the read has to go to the DS, and on such a compounding, the MDS would reach out to the DS for the information than the client. Another possibility based on what we decide as the cache mechanism. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. Venky, when you say persisting, I assume on disk, is that right? If so, then the MDS storage size requirements would increase (based on amount of file data that need to be stored). As of now it is only inodes, and as we move to a db a record. In this case we may have *fatter* MDS partitions. Any comments/thoughts on that? As with memory I would assume some form of eviction of data from MDS, to control the space utilization here as a possibility. I like the idea. However the memory implications of maintaining content in MDS is something to watch out for. quick-read is interested in files of size 64k by default and with a reasonable number of files in that range, we might end up consuming significant memory with this scheme. Vijay, I think what Venky states is to stash the file on the local storage and not in memory. If it was in memory then brick process restarts would nuke the cache, and either we need mechanisms to rebuild/warm the cache or just start caching afresh. If we were caching in memory, then yes the concern is valid, and one possibility is some form of LRU for the same, to keep memory consumption in check. Overall I would steer away from memory for this use case, and use the disk, as we do not know which files to cache (well in either case, but disk offers us more space to possibly punt on that issue). For files where the cache is missing and the file is small enough, either perform async read from the client (gaining some overlap time with the app) or just let it be, as we would get the open/read anyway, but would slow things down. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 11:34 AM, Venky Shankar wrote: On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. I like the idea. However the memory implications of maintaining content in MDS is something to watch out for. quick-read is interested in files of size 64k by default and with a reasonable number of files in that range, we might end up consuming significant memory with this scheme. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote: > > Problem is with workloads which know the files that need to be read > > without readdir, like hyperlinks (webserver), swift objects etc. These > > are two I know of which will have this problem, which can't be improved > > because we don't have metadata, data co-located. I have been trying to > > think of a solution for past few days. Nothing good is coming up :-/ > > In those cases, caching (at the MDS) would certainly help a lot. Some > variation of the compounding infrastructure under development for Samba > etc. might also apply, since this really is a compound operation. When a client is done modifying a file, MDS would refresh it's size, mtime attributes by fetching it from the DS. As part of this refresh, DS could additionally send back the content if the file size falls in range, with MDS persisting it, sending it back for subsequent lookup calls as it does now. The content (on MDS) can be zapped once the file size crosses the defined limit. But, when there are open file descriptors on an inode (O_RDWR || O_WRONLY on a file), the size cannot be trusted (as MDS only knows about the updated size after last close), which would be the degraded case. Thanks, Venky ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 07:54 PM, Jeff Darcy wrote: Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. The above is certainly an option, need to process it a bit more to respond sanely. Another one is to generate the GFID for a file with parGFID+basename as input (which was something Pranith brought a few mails back in this chain). There was concern that we will have GFID clashes, but further reasoning suggests that it would not. An example follows, Good cases: - /D1/File is created, with top 2 bytes of the files GFID as the bucket (same as D1 bucket), and rest of GFID as some UUID generation of pGFID (gfid of D1) + base name - When this file is looked up by name, its GFID can be generated at the client side as a hint, and the same fan out of lookup to MDS and read to DS can be initiated * Validity of the READ data, is good only when the lookup agrees on the same GFID for the file Bad cases: - On a rename, the GFID of the file does not change, and so if /D1/File was renamed to /D2/File1, then a subsequent lookup could fail to prefetch the read, as the GFID hint generated is now based on GFID of D2 and new name File1 - If post a rename /D1/File is again created, the GFID generated/requested by the client for this file would clash with the already generated GFID, hence the DHT server would decide to return a new GFID, that has no relation to the one generated by the hint. Again resulting in the nint failing So with the above scheme, as long as files are not renamed the hint serves its purpose to prefetch even with just the name and parGFID. One gotcha is that, I see a pattern with applications, that create a tmp file and then renames it to the real file name, sort of a swap file and then rename it to the real file as needed. For all such applications the hints above would fail. I believe even Swift also uses a similar trick on the FS to rename an object, once it is considered fully written to. Another case would be compile workload. So overall the above as a scheme could work to alleviate the problem somewhat, but may cause harm in others (where the GFID hint is incorrect and so we end up sending a read without reason). The above could easily be prototyped with DHT2 to see its benefits, so we will try that out at some point in the future. Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
> Problem is with workloads which know the files that need to be read > without readdir, like hyperlinks (webserver), swift objects etc. These > are two I know of which will have this problem, which can't be improved > because we don't have metadata, data co-located. I have been trying to > think of a solution for past few days. Nothing good is coming up :-/ In those cases, caching (at the MDS) would certainly help a lot. Some variation of the compounding infrastructure under development for Samba etc. might also apply, since this really is a compound operation. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). Another manner of achieving the same when the GFID of the file is known (from a readdir) is to wind the lookup and read of size to the respective MDS and DS, where the lookup would be responded to once the MDS responds, and the DS response is cached for the subsequent open+read case. So on the wire we would have a fan out of 2 FOPs, but still satisfy the quick read requirements. Tar kind of workload doesn't have a problem because we know the gfid after readdirp. I would assume the above resolves the problem posted, are there cases where we do not know the GFID of the file? i.e no readdir performed and client knows the file name that it wants to operate on? Do we have traces of the webserver workload to see if it generates names on the fly or does a readdir prior to that? Problem is with workloads which know the files that need to be read without readdir, like hyperlinks (webserver), swift objects etc. These are two I know of which will have this problem, which can't be improved because we don't have metadata, data co-located. I have been trying to think of a solution for past few days. Nothing good is coming up :-/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 09:20 AM, Shyam wrote: On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) This is interesting thanks for the heads up. Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). Another manner of achieving the same when the GFID of the file is known (from a readdir) is to wind the lookup and read of size to the respective MDS and DS, where the lookup would be responded to once the MDS responds, and the DS response is cached for the subsequent open+read case. So on the wire we would have a fan out of 2 FOPs, but still satisfy the quick read requirements. I would assume the above resolves the problem posted, are there cases where we do not know the GFID of the file? i.e no readdir performed and client knows the file name that it wants to operate on? Do we have traces of the webserver workload to see if it generates names on the fly or does a readdir prior to that? The open+read can be done as a single FOP, - open for a read only case can do access checking on the client to allow the FOP to proceed to the DS without hitting the MDS for an open token The client side cache is important from this and other such perspectives. It should also leverage upcall infra to keep the cache loosely coherent. One thing to note here would be, for the client to do a lookup (where the file name should be known before hand), either a readdir/(p) has to have happened, or the client knows the name already (say application generated names). For the former (readdir case), there is enough information on the client to not need a lookup, but rather just do the open+read on the DS. For the latter the first lookup cannot be avoided, degrading this to a lookup+(open+read). Some further tricks can be done to do readdir prefetching on such workloads, as the MDS runs on a DB (eventually), piggybacking more entries than requested on a lookup. I would possibly leave that for later, based on performance numbers in the small file area. Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 11:49 AM, Pranith Kumar Karampuri wrote: On 02/03/2016 09:20 AM, Shyam wrote: On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) This is interesting thanks for the heads up. Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). The open+read can be done as a single FOP, - open for a read only case can do access checking on the client to allow the FOP to proceed to the DS without hitting the MDS for an open token The client side cache is important from this and other such perspectives. It should also leverage upcall infra to keep the cache loosely coherent. One thing to note here would be, for the client to do a lookup (where the file name should be known before hand), either a readdir/(p) has to have happened, or the client knows the name already (say application generated names). For the former (readdir case), there is enough information on the client to not need a lookup, but rather just do the open+read on the DS. For the latter the first lookup cannot be avoided, degrading this to a lookup+(open+read). Some further tricks can be done to do readdir prefetching on such workloads, as the MDS runs on a DB (eventually), piggybacking more entries than requested on a lookup. I would possibly leave that for later, based on performance numbers in the small file area. I strongly suggest that we don't postpone this to later as I think this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 4.3 may be of help here. i.e. create UUID based on string, namespace. So we can use pgfid as namespace and filename as string. I understand that we will get into 2 hops if the file is renamed, but it is the best we can do right now. We can take help from crypto team in Redhat to make sure we do the right thing. If we get this implementation in dht2 after the code is released all the files created with old gfid-generation will work with half the possible perf. Gah! ignore, it will lead to gfid collisions :-/ Pranith Pranith Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/03/2016 09:20 AM, Shyam wrote: On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) This is interesting thanks for the heads up. Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). The open+read can be done as a single FOP, - open for a read only case can do access checking on the client to allow the FOP to proceed to the DS without hitting the MDS for an open token The client side cache is important from this and other such perspectives. It should also leverage upcall infra to keep the cache loosely coherent. One thing to note here would be, for the client to do a lookup (where the file name should be known before hand), either a readdir/(p) has to have happened, or the client knows the name already (say application generated names). For the former (readdir case), there is enough information on the client to not need a lookup, but rather just do the open+read on the DS. For the latter the first lookup cannot be avoided, degrading this to a lookup+(open+read). Some further tricks can be done to do readdir prefetching on such workloads, as the MDS runs on a DB (eventually), piggybacking more entries than requested on a lookup. I would possibly leave that for later, based on performance numbers in the small file area. I strongly suggest that we don't postpone this to later as I think this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 4.3 may be of help here. i.e. create UUID based on string, namespace. So we can use pgfid as namespace and filename as string. I understand that we will get into 2 hops if the file is renamed, but it is the best we can do right now. We can take help from crypto team in Redhat to make sure we do the right thing. If we get this implementation in dht2 after the code is released all the files created with old gfid-generation will work with half the possible perf. Pranith Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
- Original Message - > From: "Pranith Kumar Karampuri" > To: "Jeff Darcy" > Cc: "Gluster Devel" > Sent: Tuesday, February 2, 2016 7:52:25 PM > Subject: Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with > small files > > > > On 02/02/2016 06:22 PM, Jeff Darcy wrote: > >>Background: Quick-read + open-behind xlators are developed to help > >> in small file workload reads like apache webserver, tar etc to get the > >> data of the file in lookup FOP itself. What happens is, when a lookup > >> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and > >> posix xlator reads the file and fills the data in xdata response if this > >> key is present as long as the file-size is less than max-length given in > >> the xdata. So when we do a tar of something like a kernel tree with > >> small files, if we look at profile of the bricks all we see are lookups. > >> OPEN + READ fops will not be sent at all over the network. > >> > >>With dht2 because data is present on a different cluster. We can't > >> get the data in lookup. Shyam was telling me that opens are also sent to > >> metadata cluster. That will make perf in this usecase back to where it > >> was before introducing these two features i.e. 1/3 of current perf > >> (Lookup vs lookup+open+read) > > Is "1/3 of current perf" based on actual measurements? My understanding > > was that the translators in question exist to send requests *in parallel* > > with the original lookup stream. That means it might be 3x the messages, > > but it will only be 1/3 the performance if the network is saturated. > > Also, the lookup is not guaranteed to be only one message. It might be > > as many as N (the number of bricks), so by the reasoning above the > > performance would only drop to N/N+2. I think the real situation is a > > bit more complicated - and less dire - than you suggest. > > As per what I heard, when quick read (Now divided as open-behind and > quick-read) was introduced webserver use case users reported 300% to > 400% perf improvement. I second that. Even I've heard similar improvements for webserver use cases (Quick read was first written with apache as the use case). I tried looking for any previous data on this, but unfortunately couldn't find any. But nevertheless, we can do some performance benchmark ourselves. > We should definitely test it once we have enough code to do so. I am > just giving a heads up. > > Having said that, for 'tar' I think we can most probably do a better job > in dht2 because even after readdirp a nameless lookup comes. If it has > GF_CONTENT_KEY we should send it to data cluster directly. For webserver > usecase I don't have any ideas. > > At least on my laptop this is what I saw, on a setup with different > client, server machines, situation could be worse. This is distribute > volume with one brick. > > root@localhost - /mnt/d1 > 19:42:52 :) ⚡ time tar cf a.tgz a > > real0m6.987s > user0m0.089s > sys0m0.481s > > root@localhost - /mnt/d1 > 19:43:22 :) ⚡ cd > > root@localhost - ~ > 19:43:25 :) ⚡ umount /mnt/d1 > > root@localhost - ~ > 19:43:27 :) ⚡ gluster volume set d1 open-behind off > volume set: success > > root@localhost - ~ > 19:43:47 :) ⚡ gluster volume set d1 quick-read off > volume set: success > > root@localhost - ~ > 19:44:03 :( ⚡ gluster volume stop d1 > Stopping volume will make its data inaccessible. Do you want to > continue? (y/n) y > volume stop: d1: success > > root@localhost - ~ > 19:44:09 :) ⚡ gluster volume start d1 > volume start: d1: success > > root@localhost - ~ > 19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1 > > root@localhost - ~ > 19:44:29 :) ⚡ cd /mnt/d1 > > root@localhost - /mnt/d1 > 19:44:30 :) ⚡ time tar cf b.tgz a > > real0m12.176s > user0m0.098s > sys0m0.582s > > Pranith > > > >> I suggest that we send some fop at the > >> time of open to data cluster and change quick-read to cache this data on > >> open (if not already) then we can reduce the perf hit to 1/2 of current > >> perf, i.e. lookup+open. > > At first glance, it seems pretty simple to do something like this, and > > pretty obvious that we should. The tricky question is: where should we > > send that other op, before lookup has told us where the partition > > containing that file is? If there's some reasonable guess we can make, > > the sending an open+read in parallel with the lookup will be helpful. > > If not, then it will probably be a waste of time and network resources. > > Shyam, is enough of this information being cached *on the clients* to > > make this effective? > Pranith > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) This is interesting thanks for the heads up. Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? The file data would be located based on its GFID, so before the *first* lookup/stat for a file, there is no way to know it's GFID. NOTE: Instead of a name hash the GFID hash is used, to get immunity against renames and the like, as a name hash could change the location information for the file (among other reasons). The open+read can be done as a single FOP, - open for a read only case can do access checking on the client to allow the FOP to proceed to the DS without hitting the MDS for an open token The client side cache is important from this and other such perspectives. It should also leverage upcall infra to keep the cache loosely coherent. One thing to note here would be, for the client to do a lookup (where the file name should be known before hand), either a readdir/(p) has to have happened, or the client knows the name already (say application generated names). For the former (readdir case), there is enough information on the client to not need a lookup, but rather just do the open+read on the DS. For the latter the first lookup cannot be avoided, degrading this to a lookup+(open+read). Some further tricks can be done to do readdir prefetching on such workloads, as the MDS runs on a DB (eventually), piggybacking more entries than requested on a lookup. I would possibly leave that for later, based on performance numbers in the small file area. Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
On 02/02/2016 06:22 PM, Jeff Darcy wrote: Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read) Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. As per what I heard, when quick read (Now divided as open-behind and quick-read) was introduced webserver use case users reported 300% to 400% perf improvement. We should definitely test it once we have enough code to do so. I am just giving a heads up. Having said that, for 'tar' I think we can most probably do a better job in dht2 because even after readdirp a nameless lookup comes. If it has GF_CONTENT_KEY we should send it to data cluster directly. For webserver usecase I don't have any ideas. At least on my laptop this is what I saw, on a setup with different client, server machines, situation could be worse. This is distribute volume with one brick. root@localhost - /mnt/d1 19:42:52 :) ⚡ time tar cf a.tgz a real0m6.987s user0m0.089s sys0m0.481s root@localhost - /mnt/d1 19:43:22 :) ⚡ cd root@localhost - ~ 19:43:25 :) ⚡ umount /mnt/d1 root@localhost - ~ 19:43:27 :) ⚡ gluster volume set d1 open-behind off volume set: success root@localhost - ~ 19:43:47 :) ⚡ gluster volume set d1 quick-read off volume set: success root@localhost - ~ 19:44:03 :( ⚡ gluster volume stop d1 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: d1: success root@localhost - ~ 19:44:09 :) ⚡ gluster volume start d1 volume start: d1: success root@localhost - ~ 19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1 root@localhost - ~ 19:44:29 :) ⚡ cd /mnt/d1 root@localhost - /mnt/d1 19:44:30 :) ⚡ time tar cf b.tgz a real0m12.176s user0m0.098s sys0m0.582s Pranith I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
> Background: Quick-read + open-behind xlators are developed to help > in small file workload reads like apache webserver, tar etc to get the > data of the file in lookup FOP itself. What happens is, when a lookup > FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and > posix xlator reads the file and fills the data in xdata response if this > key is present as long as the file-size is less than max-length given in > the xdata. So when we do a tar of something like a kernel tree with > small files, if we look at profile of the bricks all we see are lookups. > OPEN + READ fops will not be sent at all over the network. > > With dht2 because data is present on a different cluster. We can't > get the data in lookup. Shyam was telling me that opens are also sent to > metadata cluster. That will make perf in this usecase back to where it > was before introducing these two features i.e. 1/3 of current perf > (Lookup vs lookup+open+read) Is "1/3 of current perf" based on actual measurements? My understanding was that the translators in question exist to send requests *in parallel* with the original lookup stream. That means it might be 3x the messages, but it will only be 1/3 the performance if the network is saturated. Also, the lookup is not guaranteed to be only one message. It might be as many as N (the number of bricks), so by the reasoning above the performance would only drop to N/N+2. I think the real situation is a bit more complicated - and less dire - than you suggest. > I suggest that we send some fop at the > time of open to data cluster and change quick-read to cache this data on > open (if not already) then we can reduce the perf hit to 1/2 of current > perf, i.e. lookup+open. At first glance, it seems pretty simple to do something like this, and pretty obvious that we should. The tricky question is: where should we send that other op, before lookup has told us where the partition containing that file is? If there's some reasonable guess we can make, the sending an open+read in parallel with the lookup will be helpful. If not, then it will probably be a waste of time and network resources. Shyam, is enough of this information being cached *on the clients* to make this effective? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files
hi, Background: Quick-read + open-behind xlators are developed to help in small file workload reads like apache webserver, tar etc to get the data of the file in lookup FOP itself. What happens is, when a lookup FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and posix xlator reads the file and fills the data in xdata response if this key is present as long as the file-size is less than max-length given in the xdata. So when we do a tar of something like a kernel tree with small files, if we look at profile of the bricks all we see are lookups. OPEN + READ fops will not be sent at all over the network. With dht2 because data is present on a different cluster. We can't get the data in lookup. Shyam was telling me that opens are also sent to metadata cluster. That will make perf in this usecase back to where it was before introducing these two features i.e. 1/3 of current perf (Lookup vs lookup+open+read). I suggest that we send some fop at the time of open to data cluster and change quick-read to cache this data on open (if not already) then we can reduce the perf hit to 1/2 of current perf, i.e. lookup+open. Sorry if this was already discussed and I didn't pay attention. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel