[Gluster-devel] Need advice re some major issues with glusterfind

Sincock, John [FLCPTY] Tue, 20 Oct 2015 17:33:46 -0700

Hi Everybody,

We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying to 
use the new glusterfind feature but have been having some serious problems with 
it. Overall the glusterfind looks very promising, so I don't want to offend 
anyone by raising these issues.


If these issues can be resolved or worked around, glusterfind will be a great 
feature.  So I would really appreciate any information or advice:

1) What can be done about the vast number of tiny changelogs? We are seeing 
often 5+ small 89 byte changelog files per minute on EACH brick. Larger files 
if busier. We've been generating these changelogs for a few weeks and have in 
excess of 10,000 or 12,000 on most bricks. This makes glusterfinds very, very 
slow, especially on a node which has a lot of bricks, and looks unsustainable 
in the long run. Why are these files so small, and why are there so many of 
them, and how are they supposed to be managed in the long run? The sheer number 
of these files looks sure to impact performance in the long run.

2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster 
adds this extended attribute to files it changes the ctime, which we were using 
to determine which files need to be archived. There should be a warning added 
to release notes & upgrade notes, so people can make a plan to manage this if 
required.

Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the rebalance 
took 5 days or so to complete, which looks like a major speed improvement over 
the more serial rebalance algorithm, so that's good. But I was hoping that the 
rebalance would also have had the side-effect of triggering all files to be 
labelled with the pgfid attribute by the time the rebalance completed, or 
failing that, after creation of an mlocate database across our entire gluster 
(which would have accessed every file, unless it is getting the info it needs 
only from directory inodes). Now it looks like ctimes are still being modified, 
and I think this can only be caused by files still being labelled with pgfids.

How can we force gluster to get this pgfid labelling over and done with, for 
all files that are already on the volume? We can't have gluster continuing to 
add pgfids in bursts here and there, eg when files are read for the first time 
since the upgrade. We need to get it over and done with. We have just had to 
turn off pgfid creation on the volume until we can force gluster to get it over 
and done with in one go.

3) Files modified just before a glusterfind pre are often not included in the 
changed files list, unless pre command is run again a bit later - I think 
changelogs are missing very recent changes and need to be flushed or something 
before the pre command uses them?

4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted 
directories (and will cause these shares to be mounted if you have autofs 
enabled). Glusterfind should definitely not follow symlinks, but it does. For 
now, we are getting around this by turning off autofs when re run glusterfinds, 
but this should not be necessary. Glusterfind must be fixed so it never follows 
symlinks and never leaves the brick it is currently searching.

5) We have one of our nodes  with 16 bricks, and on this machine, glusterfind 
pre command seems to get stuck pegging all 8 cores to 100%, an strace of an 
offending processes gives an endless stream of these lseeks and reads and very 
little else. What is going on here? It doesn't look right... : 

lseek(13, 17188864, SEEK_SET)           = 17188864
read(13, "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
lseek(13, 17189888, SEEK_SET)           = 17189888
read(13, 
"\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 
1024
lseek(13, 17190912, SEEK_SET)           = 17190912
read(13, 
"\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
lseek(13, 17191936, SEEK_SET)           = 17191936
read(13, "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
lseek(13, 17192960, SEEK_SET)           = 17192960
read(13, 
"\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
lseek(13, 17193984, SEEK_SET)           = 17193984
read(13, "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024

I saved one of these straces for 20 or 30 secs or so, and then doing a quick 
analysis of it:
    cat ~/strace.glusterfind-lseeks2.txt | wc -l
    2719285
2.7 million system calls, and grepping to exclude all the lseeks and reads 
leaves only 24 other syscalls:

cat ~/strace.glusterfind-lseeks2.txt | grep -v lseek | grep -v read
Process 28076 attached - interrupt to quit
write(13, 
"\r\0\0\0\4\0\317\0\3N\2\241\1\322\0\317\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
write(13, "\r\0\0\0\4\0_\0\3\5\2\34\1I\0_\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
write(13, 
"\r\0\0\0\4\0\24\0\3\10\2\f\1\34\0\24\0\0\0\0\202\3\203\324?\f\0!\31UU?"..., 
1024) = 1024
close(15)                               = 0
munmap(0x7f3570b01000, 4096)            = 0
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd", {st_mode=S_IFDIR|0755, st_size=4096, 
...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind", {st_mode=S_IFDIR|0755, 
st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e",
 {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history",
 {st_mode=S_IFDIR|0600, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing",
 {st_mode=S_IFDIR|0600, st_size=249856, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
 {st_mode=S_IFREG|0644, st_size=5793, ...}) = 0
write(6, "[2015-10-16 02:59:53.437769] D ["..., 273) = 273
rename("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
 
"/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processed/CHANGELOG.1444388354")
 = 0
open("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388369",
 O_RDONLY) = 15
fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f3570b01000
write(13, "\r\0\0\0\4\0]\0\3\22\0027\1L\0]\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
1024) = 1024
Process 28076 detached

That seems like an enormous number of system calls to process just one 
changelog - especially when most of these changelogs are only 89 bytes long and 
few are larger than about 5 KB, and the largest is about 20KB. We only upgraded 
to 3.7.4 several weeks ago, and we already have 12,000  or so changelogs to 
process on each brick, which will all have to be processed if I want to 
generate a listing which goes back to the time we did the upgrade - which I 
do... If each of the changelogs are being processed in this sort of apparently 
inefficient way, it must be making the process a lot slower than it needs to be.

This is a big problem and makes it almost impossible to use glusterfind for 
what we need to use it for... 

Again, I'm not intending to be negative, just hoping these issues can be 
addressed if possible, and seeking advice or info re managing these issues and 
making glusterfind usable in the meantime.

Many thanks for any advice.

John

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Need advice re some major issues with glusterfind

Reply via email to