[Gluster-users] GlusterFS performance, concurrency and I/O blocking
Hi everybody! Love this community, and I love GlusterFS. All that, despite being burned by it, likely due to my own failures. Here's the scenario where I got burned, and my guesstimates on why they happened. We run a popular .NET-based web app that gets a lot of traffic, where people build websites using our system. The long short of it is, we tested, tweaked, tested some more, over a full month. After we deployed it to production, we saw performance take a dive into the dumpster. We had to revert back fairly quickly. The obvious blame is in our testing. We load tested the system many, many times over the course of an entire month, but with a narrow range of test scenarios. The wide range of live production traffic proved to render our testing moot. We tucked our tail between our legs and are researching tools that will let us play back life traffic to serve as a better simulation. In our earlier load-testing we were able to achieve many multiples of our peak traffic, but again it wasn't realistic traffic. Before I get to my suspicion of what's happening, keep in mind that we have 50+ million files (over hundreds of thousands of directories), most of them are small, and each page request will pull in upwards of 10-40 supporting assets (images, Flash files, CSS, JS, etc.). We also have people executing directory listings whenever they're editing their site, as they choose images, etc. to insert onto the page. We're also exporting the volume to CIFS so our Windows servers can access the GlusterFS client on the Linux machines in the cluster. The Samba settings on there were tweaked to the hilt as well, turning off case-insensitivity, bumping up caches and async IO, etc. It appears as if GlusterFS has some kind of I/O blocking going on. Whenever a directory listing is being pieced together, it noticeably slows down (or stops?) other operations through the same client. For a high-concurrency app like ours where the storage backend needs to be able to pull off 10 to 100 directory listings a second, and 5,000 to 10,000 IOPS overall, it's easy to see how perf would degrade if my blocking suspicion is correct. The biggest culprit, in my guess, is the directory listing. Executing one makes things drag. I've been able to demonstrate that through a simple script. And we're running some pretty monster machines with 24 cores, 24 GB RAM, etc. I tried as many tuning permutations as possible, only to run into the same result. Jacking the cache-size, the io-thread-count to 64, etc. certainly helped performance, but continued to exhibit this blocking behavior. I also made sure that each web server accessing the GlusterFS backend was talking to its own GlusterFS client, in the hopes of increasing parallelization. I'm sure it helped, but not enough. It's nowhere close to the concurrency and performance of a straight-out Windows share. (I realize the overhead of a clustered file system will have less perf than a straight share, but we saw a drop of performance as load increased, in the order of magnitude range.) Am I way off? Does GlusterFS block on directory listings (getdents) or any other operations? If so, is there a way to enable the database equivalent of "dirty reads" so it doesn't block? Ken ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
Anand, Whit, and Joseph, I appreciate your help very, very much. Anand's assertion about Samba doing string comparisons was spot on. And Whit's suggestion to change the smb.conf to make it be case sensitive did the trick. I am also forcing "default case = lower" and "preserve case = no" in smb.conf to make sure everything stays lower-case going in. With those changes in place, I believe I can hear a song in the back of my head, "It's a whole new world..." On the web app side we will be writing a request handler that will automatically lower-case any requests coming in, so any referenced images and files will work no matter the casing specified. (I've noticed that not many Linux hosted sites and SaaS platforms handle casing well, not sure why.) We will have some users complain about case sensitivity not being maintained on their files, but I think that the huge win for us being able to use GlusterFS is worth it. There are no great Windows solutions for ever-expandable storage, and we're well past the published limitations of Windows DFS-R. DFS-R is an amazing, refined piece of technology, but it is a solution for a different kind of problem. Thanks again, guys, I never would have navigated to this solution without you. Ken ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
Joseph, Thank you for your response. Yours, combined with Whit's, led me to come up with a pretty solid repro case, and a pinpointing of what I think is going on. I tried your additional SMB configuration settings, and was hopeful, but it didn't alleviate the issue. But it was helpful, your interpretation of the logs. It makes sense now that Samba was pounding on GlusterFS, doing it's string of getdents operations. I also took your advice last night on stat-cache (I assume that was on the Gluster side, which I enabled), and wasn't sure where fast lookups was. That didn't seem to make a noticeable difference either. I think the lockups are happening as a result of being crippled by GlusterFS's relatively slow directory listing (5x-10x slower generating a dir listing than a raw SMB share), combined with FUSE's blocking readdir(). I'm not positive on that last point since there was only one mention of that on the internet. Am praying that somebody will see this and say, "oh yeah, well sure, just change this one thing in FUSE and you're good to go!" Somehow I don't think that's going to happen. :) Ken On Sun, Jul 17, 2011 at 10:35 PM, Joe Landman < land...@scalableinformatics.com> wrote: > On 07/17/2011 11:19 PM, Ken Randall wrote: > >> Joe, >> >> Thank you for your response. After seeing what you wrote, I bumped up >> the performance.cache-size up to 4096MB, the max allowed, and ran into >> the same wall. >> > > Hmmm ... > > > >> I wouldn't think that any SMB caching would help in this case, since the >> same Samba server on top of the raw Gluster data wasn't exhibiting any >> trouble, or am I deceived? >> > > Samba could cache better so it didn't have to hit Gluster so hard. > > > I haven't used strace before, but I ran it on the glusterfs process, and >> saw a lot of: >> epoll_wait(3, {{EPOLLIN, {u32=9, u64=9}}}, 257, 4294967295) = 1 >> readv(9, [{"\200\0\16,", 4}], 1)= 4 >> readv(9, [{"\0\n;\227\0\0\0\1", 8}], 1) = 8 >> readv(9, >> [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\**0\0\0\0\0\0\31\0\0\0\0\0\0\0\**1\0\0\0\0"..., >> 3620}], >> 1) = 1436 >> readv(9, 0xa90b1b8, 1) = -1 EAGAIN (Resource >> temporarily unavailable) >> > > Interesting ... I am not sure why its reporting an EAGAIN for readv, other > than it can't fill the vector from the read. > > > And when I ran it on smbd, I saw a constant stream of this kind of >> activity: >> getdents(29, /* 25 entries */, 32768) = 840 >> getdents(29, /* 25 entries */, 32768) = 856 >> getdents(29, /* 25 entries */, 32768) = 848 >> getdents(29, /* 24 entries */, 32768) = 856 >> getdents(29, /* 25 entries */, 32768) = 864 >> getdents(29, /* 24 entries */, 32768) = 832 >> getdents(29, /* 25 entries */, 32768) = 832 >> getdents(29, /* 24 entries */, 32768) = 856 >> getdents(29, /* 25 entries */, 32768) = 840 >> getdents(29, /* 24 entries */, 32768) = 832 >> getdents(29, /* 25 entries */, 32768) = 784 >> getdents(29, /* 25 entries */, 32768) = 824 >> getdents(29, /* 25 entries */, 32768) = 808 >> getdents(29, /* 25 entries */, 32768) = 840 >> getdents(29, /* 25 entries */, 32768) = 864 >> getdents(29, /* 25 entries */, 32768) = 872 >> getdents(29, /* 25 entries */, 32768) = 832 >> getdents(29, /* 24 entries */, 32768) = 832 >> getdents(29, /* 25 entries */, 32768) = 840 >> getdents(29, /* 25 entries */, 32768) = 824 >> getdents(29, /* 25 entries */, 32768) = 824 >> getdents(29, /* 24 entries */, 32768) = 864 >> getdents(29, /* 25 entries */, 32768) = 848 >> getdents(29, /* 24 entries */, 32768) = 840 >> > > Get directory entries. This is the stuff that NTFS is caching for its web > server, and it appears Samba is not. > > Try > >aio read size = 32768 >csc policy = documents >dfree cache time = 60 >directory name cache size = 10 >fake oplocks = yes >getwd cache = yes >level2 oplocks = yes >max stat cache size = 16384 > > > That chunk would get repeated over and over and over again as fast as >> the screen could go, with the occasional (every 5-10 seconds or so), >> would you see anything that you'd normally expect to see, such as: >> close(29) = 0 >> stat("Storage/01", 0x7fff07dae870) = -1 ENOENT (No such file or directory) >> write(23, >> "\0\0\0#\377SMB24\0\0\300\**210A\310\0\0\0\0\0\0\0\0\0\0\** >> 0\0\1\0d\233"...,
Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
Whit, Genius! This morning I set out to remove as many variables as possible to whittle down the repro case as much as possible. I've become pretty good at debugging memory dumps on the Windows side over the years, and even inspected the web processes. Nothing looked out of the ordinary there, just a bunch of threads waiting to get file attribute data from the Gluster share. So then, to follow your lead, I reduced the Page of Death down from thousands of images to just five. I tried accessing the page, and boom, everything's frozen for minutes. Interesting. So I reduced it to one image, accessed the page, and boom, everything's dead instantly. That one image is a file that doesn't exist. So now, knowing that GlusterFS is kicking into overdrive fretting about a file it can't find, I decided to eliminate the web server altogether. I opened up Windows Explorer, and typed in a directory that didn't exist, and sure enough, I'm unable to navigate through the share in another Explorer window until it finally responds again a minute later. I think the Page of Death was exhibiting such a massive death (e.g. only able to respond again upwards of five minutes later) because it was systematically trying to access several files that weren't found, and each one it can't find causes the SMB connection to hang for close to a minute. I feel like this is a bit of major progress toward pinpointing the problem for a possible resolution. Here are some additional details that may help: The GlusterFS directory in question, /storage, has about 80,000 subdirs in it. As such, I'm using ext4 to overcome the subdir limitations of ext3. The non-existent image file that is able to cause everything to freeze exists in a directory, /storage/thisdirdoesntexist/images/blah.gif, where "thisdirdoesntexist" is in that storage directory along with those 80,000 real subdirs. I know it's a pretty laborious thing for Gluster to piece together a directory listing, and combined with Joseph's recognition of the flood of "getdents", does it seem reasonable that Gluster or Samba is freezing because it's for some reason generating a subdir listing of /storage whenever it can't find one of its subdirs? As another test, if I access a file inside a non-existent subdir of a dir that only has five subdirs, and nothing freezes. So the freezing seems to be a function of the number of subdirectories that are siblings of the first part of the path that doesn't exist, if that makes sense. So in /this/is/a/long/path, if "is" doesn't exist, then Samba will generate a list of subdirs under "/this". And if "/this" has 100,000 immediate subdirs under it, then you're about to experience a world of hurt. I read some where that FUSE's implementation of readdir() is a blocking operation. If true, the above explanation, plus FUSE's readdir(), are to blame. And I am therefore up a creek. It is not feasible to enforce the system to only have a few subdirs at any given level to prevent the lockup. Unless somebody, after reading this novel, has some ideas for me to try. =) Any magical ways to not get FUSE to block, or any trickery on Samba's side? Ken On Sun, Jul 17, 2011 at 10:29 PM, Whit Blauvelt wrote: > On Sun, Jul 17, 2011 at 10:19:00PM -0500, Ken Randall wrote: > > > (The no such file or directory part is expected since some of the image > > references don't exist.) > > Wild guess on that: Gluster may work harder at files it doesn't find than > files it finds. It's going to look on one side or the other of the > replicated file at first, and if it finds the file deliver it. But if it > doesn't find the file, wouldn't it then check the other side of the > replicated storage to make sure this wasn't a replication error? > > Might be interesting to run a version of the test where all the images > referenced do exist, to see if it's the missing files that are driving up > the CPU cycles. > > Whit > > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
Joe, Thank you for your response. After seeing what you wrote, I bumped up the performance.cache-size up to 4096MB, the max allowed, and ran into the same wall. I wouldn't think that any SMB caching would help in this case, since the same Samba server on top of the raw Gluster data wasn't exhibiting any trouble, or am I deceived? I haven't used strace before, but I ran it on the glusterfs process, and saw a lot of: epoll_wait(3, {{EPOLLIN, {u32=9, u64=9}}}, 257, 4294967295) = 1 readv(9, [{"\200\0\16,", 4}], 1)= 4 readv(9, [{"\0\n;\227\0\0\0\1", 8}], 1) = 8 readv(9, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\31\0\0\0\0\0\0\0\1\0\0\0\0"..., 3620}], 1) = 1436 readv(9, 0xa90b1b8, 1) = -1 EAGAIN (Resource temporarily unavailable) And when I ran it on smbd, I saw a constant stream of this kind of activity: getdents(29, /* 25 entries */, 32768) = 840 getdents(29, /* 25 entries */, 32768) = 856 getdents(29, /* 25 entries */, 32768) = 848 getdents(29, /* 24 entries */, 32768) = 856 getdents(29, /* 25 entries */, 32768) = 864 getdents(29, /* 24 entries */, 32768) = 832 getdents(29, /* 25 entries */, 32768) = 832 getdents(29, /* 24 entries */, 32768) = 856 getdents(29, /* 25 entries */, 32768) = 840 getdents(29, /* 24 entries */, 32768) = 832 getdents(29, /* 25 entries */, 32768) = 784 getdents(29, /* 25 entries */, 32768) = 824 getdents(29, /* 25 entries */, 32768) = 808 getdents(29, /* 25 entries */, 32768) = 840 getdents(29, /* 25 entries */, 32768) = 864 getdents(29, /* 25 entries */, 32768) = 872 getdents(29, /* 25 entries */, 32768) = 832 getdents(29, /* 24 entries */, 32768) = 832 getdents(29, /* 25 entries */, 32768) = 840 getdents(29, /* 25 entries */, 32768) = 824 getdents(29, /* 25 entries */, 32768) = 824 getdents(29, /* 24 entries */, 32768) = 864 getdents(29, /* 25 entries */, 32768) = 848 getdents(29, /* 24 entries */, 32768) = 840 That chunk would get repeated over and over and over again as fast as the screen could go, with the occasional (every 5-10 seconds or so), would you see anything that you'd normally expect to see, such as: close(29) = 0 stat("Storage/01", 0x7fff07dae870) = -1 ENOENT (No such file or directory) write(23, "\0\0\0#\377SMB24\0\0\300\210A\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0d\233"..., 39) = 39 select(38, [5 20 23 27 30 31 35 36 37], [], NULL, {60, 0}) = 1 (in [23], left {60, 0}) read(23, "\0\0\0x", 4) = 4 read(23, "\377SMB2\0\0\0\0\30\7\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0\250P\273\0[8"..., 120) = 120 stat("Storage", {st_mode=S_IFDIR|0755, st_size=1581056, ...}) = 0 stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or directory) stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or directory) open("Storage", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 29 fcntl(29, F_SETFD, FD_CLOEXEC) = 0 (The no such file or directory part is expected since some of the image references don't exist.) Ken ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
Thanks for the reply, Whit! Perfectly reasonable first question. The websites have user-generated content (think CMS), where people could put in that kind of content. The likelihood of such a scenario is slim-to-none, but I'd rather not have that kind of vulnerability in the first place. And yes, we could also add in validation and/or stripping of content that is outside the bounds of normal, but the main reason I bring up this Page of Death scenario is that I worry that it may be indicative of a weakness in the system that a different kind of load pattern could trigger this kind of hang. To answer the second question, running top on the Linux side during the Page of Death (with nothing else running) I get a CPU % spike of anywhere between 80-110% on glusterfsd, and 20% on glusterfs, with close to 22GB of memory free. The machines are 16-core apiece, though. On the Windows side there is next to no effect on CPU, memory, or network utilization. Ken On Sun, Jul 17, 2011 at 8:06 PM, Whit Blauvelt wrote: > On Sun, Jul 17, 2011 at 07:56:57PM -0500, Ken Randall wrote: > > > However, as a part of a different suite of tests is a Page of Death, > which > > contains tens of thousands of image references on a single page. > > Off topic response: Is there ever in real production any page, anywhere, > tht contains tens of thousands of image references? I'm all for testing at > the extreme, and capacity that goes far beyond what's needed for practical > purposes. Is that what this is, or do you anticipate real-life Page o' > Death > scenarios? > > Closer to the topic: What's going on with the load on the various systems. > On the Linux side, have you watched each of them with something like htop? > > Whit > > ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS
I'll try to keep it brief, I've been testing GlusterFS for the last month or so. My production setup will be more complex than what I'm listing below, but I've whittled things down to where the below setup will cause the problem to happen. I'm running GlusterFS 3.2.2 on two CentOS 5.6 boxes in a replicated volume. I am connecting to it with a Windows Server 2008 R2 box over an SMB share. Basically, the web app portion runs locally on the Windows box, but content (e.g. HTML templates, images, CSS files, JS, etc.) is being pulled from the Gluster volume. I've performed a fair degree of load testing on the setup so far, scaling up the load to nearly four times what our normal production environment sees in primetime, and it seems to handle it fine. We run tens of thousands of websites, so this is pretty significant that it's able to handle that. However, as a part of a different suite of tests is a Page of Death, which contains tens of thousands of image references on a single page. All I have to do is load that page for a few seconds, and it will grind my web server's SMB connection to a near complete standstill. I can close the browser after just a few seconds, and it still takes several minutes for the web server to respond to any requests at all. Connecting to the share over Explorer is extremely slow from that same machine. (I can connect to that same share from another machine, which is an export of the same exact GlusterFS mount, and it is just fine. Similarly, accessing the Gluster mount on the Linux boxes shows zero problems at all, it's as happy to respond to requests as ever.) Even if I scale it out to a swath of web servers, loading that single page, one time, for just a few seconds will freeze every single web server, making every website on the system inaccessible. You may be asking, why am I asking here instead of on a Samba group, or even a Windows group? Here's why: My control is that I have a Windows file server that I can swap in Gluster's place, and I'm able to load that page without it blinking an eye (it actually becomes a test of the computer that the browser is on). It does not affect any of the web servers' in the slightest. My second control is that I have exported the raw Gluster data directory as an SMB share (with the same exact Samba configuration as the Gluster one), and it performs equally as well as the Windows file server. I can load the Page of Death with no consequence. I've pushed IO-threads all the way to the maximum 64 without any benefit. I can't see anything noteworthy in the Gluster or Samba logs, but perhaps I am not sure what to look for. Thank you to anybody who can point me the right direction. I am hoping I don't have to dive into Wireshark or tcpdump territory, but I'm open if you can guide the way! ;) Ken ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users