[Gluster-users] GlusterFS performance, concurrency and I/O blocking

2011-08-23 Thread Ken Randall
Hi everybody!  Love this community, and I love GlusterFS.

All that, despite being burned by it, likely due to my own failures.  Here's
the scenario where I got burned, and my guesstimates on why they happened.

We run a popular .NET-based web app that gets a lot of traffic, where people
build websites using our system.  The long short of it is, we tested,
tweaked, tested some more, over a full month.  After we deployed it to
production, we saw performance take a dive into the dumpster.  We had to
revert back fairly quickly.

The obvious blame is in our testing.  We load tested the system many, many
times over the course of an entire month, but with a narrow range of test
scenarios.  The wide range of live production traffic proved to render our
testing moot.  We tucked our tail between our legs and are researching tools
that will let us play back life traffic to serve as a better simulation.  In
our earlier load-testing we were able to achieve many multiples of our peak
traffic, but again it wasn't realistic traffic.

Before I get to my suspicion of what's happening, keep in mind that we have
50+ million files (over hundreds of thousands of directories), most of them
are small, and each page request will pull in upwards of 10-40 supporting
assets (images, Flash files, CSS, JS, etc.).  We also have people executing
directory listings whenever they're editing their site, as they choose
images, etc. to insert onto the page.  We're also exporting the volume to
CIFS so our Windows servers can access the GlusterFS client on the Linux
machines in the cluster.  The Samba settings on there were tweaked to the
hilt as well, turning off case-insensitivity, bumping up caches and async
IO, etc.

It appears as if GlusterFS has some kind of I/O blocking going on.  Whenever
a directory listing is being pieced together, it noticeably slows down (or
stops?) other operations through the same client.  For a high-concurrency
app like ours where the storage backend needs to be able to pull off 10 to
100 directory listings a second, and 5,000 to 10,000 IOPS overall, it's easy
to see how perf would degrade if my blocking suspicion is correct.  The
biggest culprit, in my guess, is the directory listing.  Executing one makes
things drag.  I've been able to demonstrate that through a simple script.
And we're running some pretty monster machines with 24 cores, 24 GB RAM,
etc.

I tried as many tuning permutations as possible, only to run into the same
result.  Jacking the cache-size, the io-thread-count to 64, etc. certainly
helped performance, but continued to exhibit this blocking behavior.  I also
made sure that each web server accessing the GlusterFS backend was talking
to its own GlusterFS client, in the hopes of increasing parallelization.
I'm sure it helped, but not enough.  It's nowhere close to the concurrency
and performance of a straight-out Windows share.  (I realize the overhead of
a clustered file system will have less perf than a straight share, but we
saw a drop of performance as load increased, in the order of magnitude
range.)

Am I way off?  Does GlusterFS block on directory listings (getdents) or any
other operations?  If so, is there a way to enable the database equivalent
of "dirty reads" so it doesn't block?

Ken
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-18 Thread Ken Randall
Anand, Whit, and Joseph,

I appreciate your help very, very much.  Anand's assertion about Samba doing
string comparisons was spot on.  And Whit's suggestion to change the
smb.conf to make it be case sensitive did the trick.  I am also forcing
"default case = lower" and "preserve case = no" in smb.conf to make sure
everything stays lower-case going in.

With those changes in place, I believe I can hear a song in the back of my
head, "It's a whole new world..."

On the web app side we will be writing a request handler that will
automatically lower-case any requests coming in, so any referenced images
and files will work no matter the casing specified.  (I've noticed that not
many Linux hosted sites and SaaS platforms handle casing well, not sure
why.)

We will have some users complain about case sensitivity not being maintained
on their files, but I think that the huge win for us being able to use
GlusterFS is worth it.  There are no great Windows solutions for
ever-expandable storage, and we're well past the published limitations of
Windows DFS-R.  DFS-R is an amazing, refined piece of technology, but it is
a solution for a different kind of problem.

Thanks again, guys, I never would have navigated to this solution without
you.

Ken
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-18 Thread Ken Randall
Joseph,

Thank you for your response.  Yours, combined with Whit's, led me to come up
with a pretty solid repro case, and a pinpointing of what I think is going
on.

I tried your additional SMB configuration settings, and was hopeful, but it
didn't alleviate the issue.  But it was helpful, your interpretation of the
logs.  It makes sense now that Samba was pounding on GlusterFS, doing it's
string of getdents operations.

I also took your advice last night on stat-cache (I assume that was on the
Gluster side, which I enabled), and wasn't sure where fast lookups was.
That didn't seem to make a noticeable difference either.

I think the lockups are happening as a result of being crippled by
GlusterFS's relatively slow directory listing (5x-10x slower generating a
dir listing than a raw SMB share), combined with FUSE's blocking readdir().
I'm not positive on that last point since there was only one mention of that
on the internet.   Am praying that somebody will see this and say, "oh yeah,
well sure, just change this one thing in FUSE and you're good to go!"
Somehow I don't think that's going to happen.  :)

Ken

On Sun, Jul 17, 2011 at 10:35 PM, Joe Landman <
land...@scalableinformatics.com> wrote:

> On 07/17/2011 11:19 PM, Ken Randall wrote:
>
>> Joe,
>>
>> Thank you for your response.  After seeing what you wrote, I bumped up
>> the performance.cache-size up to 4096MB, the max allowed, and ran into
>> the same wall.
>>
>
> Hmmm ...
>
>
>
>> I wouldn't think that any SMB caching would help in this case, since the
>> same Samba server on top of the raw Gluster data wasn't exhibiting any
>> trouble, or am I deceived?
>>
>
> Samba could cache better so it didn't have to hit Gluster so hard.
>
>
>  I haven't used strace before, but I ran it on the glusterfs process, and
>> saw a lot of:
>> epoll_wait(3, {{EPOLLIN, {u32=9, u64=9}}}, 257, 4294967295) = 1
>> readv(9, [{"\200\0\16,", 4}], 1)= 4
>> readv(9, [{"\0\n;\227\0\0\0\1", 8}], 1) = 8
>> readv(9,
>> [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\**0\0\0\0\0\0\31\0\0\0\0\0\0\0\**1\0\0\0\0"...,
>> 3620}],
>> 1) = 1436
>> readv(9, 0xa90b1b8, 1)  = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>
> Interesting ... I am not sure why its reporting an EAGAIN for readv, other
> than it can't fill the vector from the read.
>
>
>  And when I ran it on smbd, I saw a constant stream of this kind of
>> activity:
>> getdents(29, /* 25 entries */, 32768)   = 840
>> getdents(29, /* 25 entries */, 32768)   = 856
>> getdents(29, /* 25 entries */, 32768)   = 848
>> getdents(29, /* 24 entries */, 32768)   = 856
>> getdents(29, /* 25 entries */, 32768)   = 864
>> getdents(29, /* 24 entries */, 32768)   = 832
>> getdents(29, /* 25 entries */, 32768)   = 832
>> getdents(29, /* 24 entries */, 32768)   = 856
>> getdents(29, /* 25 entries */, 32768)   = 840
>> getdents(29, /* 24 entries */, 32768)   = 832
>> getdents(29, /* 25 entries */, 32768)   = 784
>> getdents(29, /* 25 entries */, 32768)   = 824
>> getdents(29, /* 25 entries */, 32768)   = 808
>> getdents(29, /* 25 entries */, 32768)   = 840
>> getdents(29, /* 25 entries */, 32768)   = 864
>> getdents(29, /* 25 entries */, 32768)   = 872
>> getdents(29, /* 25 entries */, 32768)   = 832
>> getdents(29, /* 24 entries */, 32768)   = 832
>> getdents(29, /* 25 entries */, 32768)   = 840
>> getdents(29, /* 25 entries */, 32768)   = 824
>> getdents(29, /* 25 entries */, 32768)   = 824
>> getdents(29, /* 24 entries */, 32768)   = 864
>> getdents(29, /* 25 entries */, 32768)   = 848
>> getdents(29, /* 24 entries */, 32768)   = 840
>>
>
> Get directory entries.  This is the stuff that NTFS is caching for its web
> server, and it appears Samba is not.
>
> Try
>
>aio read size = 32768
>csc policy = documents
>dfree cache time = 60
>directory name cache size = 10
>fake oplocks = yes
>getwd cache = yes
>level2 oplocks = yes
>max stat cache size = 16384
>
>
>  That chunk would get repeated over and over and over again as fast as
>> the screen could go, with the occasional (every 5-10 seconds or so),
>> would you see anything that you'd normally expect to see, such as:
>> close(29)   = 0
>> stat("Storage/01", 0x7fff07dae870) = -1 ENOENT (No such file or directory)
>> write(23,
>> "\0\0\0#\377SMB24\0\0\300\**210A\310\0\0\0\0\0\0\0\0\0\0\**
>> 0\0\1\0d\233"...,

Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-18 Thread Ken Randall
Whit,

Genius!

This morning I set out to remove as many variables as possible to whittle
down the repro case as much as possible.  I've become pretty good at
debugging memory dumps on the Windows side over the years, and even
inspected the web processes.  Nothing looked out of the ordinary there, just
a bunch of threads waiting to get file attribute data from the Gluster
share.

So then, to follow your lead, I reduced the Page of Death down from
thousands of images to just five.  I tried accessing the page, and boom,
everything's frozen for minutes.  Interesting.  So I reduced it to one
image, accessed the page, and boom, everything's dead instantly.  That one
image is a file that doesn't exist.

So now, knowing that GlusterFS is kicking into overdrive fretting about a
file it can't find, I decided to eliminate the web server altogether.  I
opened up Windows Explorer, and typed in a directory that didn't exist, and
sure enough, I'm unable to navigate through the share in another Explorer
window until it finally responds again a minute later.  I think the Page of
Death was exhibiting such a massive death (e.g. only able to respond again
upwards of five minutes later) because it was systematically trying to
access several files that weren't found, and each one it can't find causes
the SMB connection to hang for close to a minute.

I feel like this is a bit of major progress toward pinpointing the problem
for a possible resolution.  Here are some additional details that may help:

The GlusterFS directory in question, /storage, has about 80,000 subdirs in
it.  As such, I'm using ext4 to overcome the subdir limitations of ext3.
The non-existent image file that is able to cause everything to freeze
exists in a directory, /storage/thisdirdoesntexist/images/blah.gif, where
"thisdirdoesntexist" is in that storage directory along with those 80,000
real subdirs.  I know it's a pretty laborious thing for Gluster to piece
together a directory listing, and combined with Joseph's recognition of the
flood of "getdents", does it seem reasonable that Gluster or Samba is
freezing because it's for some reason generating a subdir listing of
/storage whenever it can't find one of its subdirs?

As another test, if I access a file inside a non-existent subdir of a dir
that only has five subdirs, and nothing freezes.

So the freezing seems to be a function of the number of subdirectories that
are siblings of the first part of the path that doesn't exist, if that makes
sense.  So in /this/is/a/long/path, if "is" doesn't exist, then Samba will
generate a list of subdirs under "/this".  And if "/this" has 100,000
immediate subdirs under it, then you're about to experience a world of hurt.

I read some where that FUSE's implementation of readdir() is a blocking
operation.  If true, the above explanation, plus FUSE's readdir(), are to
blame.

And I am therefore up a creek.  It is not feasible to enforce the system to
only have a few subdirs at any given level to prevent the lockup.  Unless
somebody, after reading this novel, has some ideas for me to try.  =)  Any
magical ways to not get FUSE to block, or any trickery on Samba's side?

Ken



On Sun, Jul 17, 2011 at 10:29 PM, Whit Blauvelt
wrote:

> On Sun, Jul 17, 2011 at 10:19:00PM -0500, Ken Randall wrote:
>
> > (The no such file or directory part is expected since some of the image
> > references don't exist.)
>
> Wild guess on that: Gluster may work harder at files it doesn't find than
> files it finds. It's going to look on one side or the other of the
> replicated file at first, and if it finds the file deliver it. But if it
> doesn't find the file, wouldn't it then check the other side of the
> replicated storage to make sure this wasn't a replication error?
>
> Might be interesting to run a version of the test where all the images
> referenced do exist, to see if it's the missing files that are driving up
> the CPU cycles.
>
> Whit
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-17 Thread Ken Randall
Joe,

Thank you for your response.  After seeing what you wrote, I bumped up the
performance.cache-size up to 4096MB, the max allowed, and ran into the same
wall.

I wouldn't think that any SMB caching would help in this case, since the
same Samba server on top of the raw Gluster data wasn't exhibiting any
trouble, or am I deceived?

I haven't used strace before, but I ran it on the glusterfs process, and saw
a lot of:
epoll_wait(3, {{EPOLLIN, {u32=9, u64=9}}}, 257, 4294967295) = 1
readv(9, [{"\200\0\16,", 4}], 1)= 4
readv(9, [{"\0\n;\227\0\0\0\1", 8}], 1) = 8
readv(9,
[{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\31\0\0\0\0\0\0\0\1\0\0\0\0"...,
3620}], 1) = 1436
readv(9, 0xa90b1b8, 1)  = -1 EAGAIN (Resource temporarily
unavailable)

And when I ran it on smbd, I saw a constant stream of this kind of activity:
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 848
getdents(29, /* 24 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 864
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 832
getdents(29, /* 24 entries */, 32768)   = 856
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 784
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 25 entries */, 32768)   = 808
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 864
getdents(29, /* 25 entries */, 32768)   = 872
getdents(29, /* 25 entries */, 32768)   = 832
getdents(29, /* 24 entries */, 32768)   = 832
getdents(29, /* 25 entries */, 32768)   = 840
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 25 entries */, 32768)   = 824
getdents(29, /* 24 entries */, 32768)   = 864
getdents(29, /* 25 entries */, 32768)   = 848
getdents(29, /* 24 entries */, 32768)   = 840

That chunk would get repeated over and over and over again as fast as the
screen could go, with the occasional (every 5-10 seconds or so), would you
see anything that you'd normally expect to see, such as:
close(29)   = 0
stat("Storage/01", 0x7fff07dae870) = -1 ENOENT (No such file or directory)
write(23,
"\0\0\0#\377SMB24\0\0\300\210A\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0d\233"..., 39)
= 39
select(38, [5 20 23 27 30 31 35 36 37], [], NULL, {60, 0}) = 1 (in [23],
left {60, 0})
read(23, "\0\0\0x", 4)  = 4
read(23,
"\377SMB2\0\0\0\0\30\7\310\0\0\0\0\0\0\0\0\0\0\0\0\1\0\250P\273\0[8"...,
120) = 120
stat("Storage", {st_mode=S_IFDIR|0755, st_size=1581056, ...}) = 0
stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or
directory)
stat("Storage/011235", 0x7fff07dad470) = -1 ENOENT (No such file or
directory)
open("Storage", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 29
fcntl(29, F_SETFD, FD_CLOEXEC)  = 0

(The no such file or directory part is expected since some of the image
references don't exist.)

Ken
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-17 Thread Ken Randall
Thanks for the reply, Whit!

Perfectly reasonable first question.  The websites have user-generated
content (think CMS), where people could put in that kind of content.  The
likelihood of such a scenario is slim-to-none, but I'd rather not have that
kind of vulnerability in the first place.  And yes, we could also add in
validation and/or stripping of content that is outside the bounds of normal,
but the main reason I bring up this Page of Death scenario is that I worry
that it may be indicative of a weakness in the system that a different kind
of load pattern could trigger this kind of hang.

To answer the second question, running top on the Linux side during the Page
of Death (with nothing else running) I get a CPU % spike of anywhere between
80-110% on glusterfsd, and 20% on glusterfs, with close to 22GB of memory
free.  The machines are 16-core apiece, though.  On the Windows side there
is next to no effect on CPU, memory, or network utilization.

Ken

On Sun, Jul 17, 2011 at 8:06 PM, Whit Blauvelt
wrote:

> On Sun, Jul 17, 2011 at 07:56:57PM -0500, Ken Randall wrote:
>
> > However, as a part of a different suite of tests is a Page of Death,
> which
> > contains tens of thousands of image references on a single page.
>
> Off topic response: Is there ever in real production any page, anywhere,
> tht contains tens of thousands of image references? I'm all for testing at
> the extreme, and capacity that goes far beyond what's needed for practical
> purposes. Is that what this is, or do you anticipate real-life Page o'
> Death
> scenarios?
>
> Closer to the topic: What's going on with the load on the various systems.
> On the Linux side, have you watched each of them with something like htop?
>
> Whit
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] not sure how to troubleshoot SMB / CIFS overload when using GlusterFS

2011-07-17 Thread Ken Randall
I'll try to keep it brief, I've been testing GlusterFS for the last month or
so.  My production setup will be more complex than what I'm listing below,
but I've whittled things down to where the below setup will cause the
problem to happen.

I'm running GlusterFS 3.2.2 on two CentOS 5.6 boxes in a replicated volume.
I am connecting to it with a Windows Server 2008 R2 box over an SMB share.
Basically, the web app portion runs locally on the Windows box, but content
(e.g. HTML templates, images, CSS files, JS, etc.) is being pulled from the
Gluster volume.

I've performed a fair degree of load testing on the setup so far, scaling up
the load to nearly four times what our normal production environment sees in
primetime, and it seems to handle it fine.  We run tens of thousands of
websites, so this is pretty significant that it's able to handle that.

However, as a part of a different suite of tests is a Page of Death, which
contains tens of thousands of image references on a single page.  All I have
to do is load that page for a few seconds, and it will grind my web server's
SMB connection to a near complete standstill.  I can close the browser after
just a few seconds, and it still takes several minutes for the web server to
respond to any requests at all.  Connecting to the share over Explorer is
extremely slow from that same machine.  (I can connect to that same share
from another machine, which is an export of the same exact GlusterFS mount,
and it is just fine.  Similarly, accessing the Gluster mount on the Linux
boxes shows zero problems at all, it's as happy to respond to requests as
ever.)

Even if I scale it out to a swath of web servers, loading that single page,
one time, for just a few seconds will freeze every single web server, making
every website on the system inaccessible.

You may be asking, why am I asking here instead of on a Samba group, or even
a Windows group?  Here's why:  My control is that I have a Windows file
server that I can swap in Gluster's place, and I'm able to load that page
without it blinking an eye (it actually becomes a test of the computer that
the browser is on).  It does not affect any of the web servers' in the
slightest.  My second control is that I have exported the raw Gluster data
directory as an SMB share (with the same exact Samba configuration as the
Gluster one), and it performs equally as well as the Windows file server.  I
can load the Page of Death with no consequence.

I've pushed IO-threads all the way to the maximum 64 without any benefit.  I
can't see anything noteworthy in the Gluster or Samba logs, but perhaps I am
not sure what to look for.

Thank you to anybody who can point me the right direction.  I am hoping I
don't have to dive into Wireshark or tcpdump territory, but I'm open if you
can guide the way!  ;)

Ken
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users