Raghavendra and Raghavendra,

Thanks, I will enable tracing and reply with logs.  I will also rebuild my test 
bed to use simpler apache configs.  I appreciate your efforts, it’s good to 
know we should expect things to “just work” as a starting point, that gives me 
hope we can fix this here.  To that end, you’ve already helped immeasurably.

From: Raghavendra Bhat <rab...@redhat.com<mailto:rab...@redhat.com>>
Date: Wednesday, September 2, 2015 at 5:56 AM
To: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" 
<gluster-users@gluster.org<mailto:gluster-users@gluster.org>>, Christian Rice 
<cr...@pandora.com<mailto:cr...@pandora.com>>
Cc: Raghavendra Gowdappa <rgowd...@redhat.com<mailto:rgowd...@redhat.com>>
Subject: Re: [Gluster-users] Gluster 3.6.3 performance.cache-size not working 
as expected in some cases

On 09/02/2015 12:45 PM, Raghavendra Bhat wrote:
you
Hi Christian,

I have been working on it since couple of days. I have not been able to 
recreate the issue. I will continue to recreate and get back to you in a day or 
two.

Regards,
Raghavendra Bhat


Hi Christian,

As per our tests (me and Raghavendra G in CC) we found that the data was being 
served from cache. In fact in our tests data was being served from the kernel 
cache itself.  So we tried with dropping the data cache (i.e. echo 1 > 
/proc/sys/vm/drop_caches)  to see if the read requests coming to glusterfs 
(since the data cache in the kernel is dropped) . We found that the read calls 
were coming to glusterfs and glusterfs is serving the requests from the cache 
(i.e. io-cache xlator).  If the memory pressure on the system is huge and 
kernel is sending forgets to the glusterfs client, then there is a possibility 
that the inodes are forgotten along with the data cached within them.

Can you please enable trace log level for the client and run your tests? 
(gluster volume set <volname> client-log-level TRACE) Once your tests are done 
please give the logs.

NOTE: Enabling trace log level will increase the log file size faster due to 
more logging.

Regards,
Raghavendra Bhat

On 09/02/2015 12:45 AM, Christian Rice wrote:
This is still an issue for me, I don’t need anyone to tear the code apart, but 
I’d be grateful if someone would even chime in and say “yeah, we’ve seen that 
too."

From: Christian Rice <cr...@pandora.com<mailto:cr...@pandora.com>>
Date: Sunday, August 30, 2015 at 11:18 PM
To: "gluster-users@gluster.org<mailto:gluster-users@gluster.org>" 
<gluster-users@gluster.org<mailto:gluster-users@gluster.org>>
Subject: [Gluster-users] Gluster 3.6.3 performance.cache-size not working as 
expected in some cases

I am confused about my caching problem.  I’ll try to keep this as 
straightforward as possible and include the basic details...

I have a sixteen node distributed volume, one brick per node, XFS isize=512, 
Debian 7/Wheezy, 32GB RAM minimally.  Every brick node is also a gluster 
client, and also importantly an HTTP server.  We use a back-end 1GbE network 
for gluster traffic (eth1).  There are a couple dozen gluster client-only 
systems accessing this volume, as well.

We had a really hot spot on one brick due to an oft-requested file, and every 
time any httpd process on any gluster client was asked to deliver the file, it 
was physically fetching it (we could see this traffic using, say, ‘iftop -i 
eth1’,) so we thought to increase the volume cache timeout and cache size.  We 
set the following values for testing:

performance.cache-size 16GB
performance.cache-refresh-timeout: 30

This test was run from a node that didn’t have the requested file on the local 
brick:

while(true); do cat /path/to/file > /dev/null; done

and what had been very high traffic on the gluster backend network, delivering 
the data repeatedly to my requesting node, dropped to nothing visible.

I thought good, problem fixed.  Caching works.  My colleague had run a test 
early on to show this perf issue, so he ran it again to sign off.

His testing used curl, because all the real front end traffic is HTTP, and all 
the gluster nodes are web servers, which are of course using the fuse mount to 
access the document root.  Even with our performance tuning, the traffic on the 
gluster backend subnet was continuous and undiminished.  I saw no evidence of 
cache (again using ‘iftop -i eth1’, which showed a steady 75+% of line rate on 
a 1GbE link.

Does that make sense at all?  We had theorized that we wouldn’t get to use 
VFS/kernel page cache on any node except maybe the one which held the data in 
the local brick.  That’s what drove us to setting the gluster performance 
cache.  But it doesn’t seem to come into play with http access.


Volume info:
Volume Name: DOCROOT
Type: Distribute
Volume ID: 3aecd277-4d26-44cd-879d-cffbb1fec6ba
Status: Started
Number of Bricks: 16
Transport-type: tcp
Bricks:
<snipped list of bricks>
Options Reconfigured:
performance.cache-refresh-timeout: 30
performance.cache-size: 16GB

The net result of being overwhelmed by a hot spot is all the gluster client 
nodes lose access to the gluster volume—it becomes so busy it hangs.  When the 
traffic goes away (failing health checks by load balancers causes requests to 
be redirected elsewhere), the volume eventually unfreezes and life goes on.

I wish I could type ALL that into a google query and get a lucid answer :)

Regards,
Christian



_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>http://www.gluster.org/mailman/listinfo/gluster-users




_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to