On 02/02/2016 06:22 PM, Jeff Darcy wrote:
Background: Quick-read + open-behind xlators are developed to help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.
With dht2 because data is present on a different cluster. We can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)
Is "1/3 of current perf" based on actual measurements? My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream. That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message. It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2. I think the real situation is a
bit more complicated - and less dire - than you suggest.
As per what I heard, when quick read (Now divided as open-behind and
quick-read) was introduced webserver use case users reported 300% to
400% perf improvement.
We should definitely test it once we have enough code to do so. I am
just giving a heads up.
Having said that, for 'tar' I think we can most probably do a better job
in dht2 because even after readdirp a nameless lookup comes. If it has
GF_CONTENT_KEY we should send it to data cluster directly. For webserver
usecase I don't have any ideas.
At least on my laptop this is what I saw, on a setup with different
client, server machines, situation could be worse. This is distribute
volume with one brick.
root@localhost - /mnt/d1
19:42:52 :) ⚡ time tar cf a.tgz a
real 0m6.987s
user 0m0.089s
sys 0m0.481s
root@localhost - /mnt/d1
19:43:22 :) ⚡ cd
root@localhost - ~
19:43:25 :) ⚡ umount /mnt/d1
root@localhost - ~
19:43:27 :) ⚡ gluster volume set d1 open-behind off
volume set: success
root@localhost - ~
19:43:47 :) ⚡ gluster volume set d1 quick-read off
volume set: success
root@localhost - ~
19:44:03 :( ⚡ gluster volume stop d1
Stopping volume will make its data inaccessible. Do you want to
continue? (y/n) y
volume stop: d1: success
root@localhost - ~
19:44:09 :) ⚡ gluster volume start d1
volume start: d1: success
root@localhost - ~
19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1
root@localhost - ~
19:44:29 :) ⚡ cd /mnt/d1
root@localhost - /mnt/d1
19:44:30 :) ⚡ time tar cf b.tgz a
real 0m12.176s
user 0m0.098s
sys 0m0.582s
Pranith
I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.
At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should. The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is? If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel