Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Konstantin Ryabitsev wrote: > On Fri, Sep 13, 2019 at 03:12:12AM +, Eric Wong wrote: > > > No, it's just vanilla what comes with the source. > > > > OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) > > Correct, though another thing that may help is the git version -- we > install git2u from IUS, so the git version is not what is in standard > CentOS7 (it's too ancient). > > # rpm -q git2u > git2u-2.16.5-1.ius.centos7.x86_64 > > See if you have more luck reproducing the problem when you get that in > place. You should be able to find what you need by searching for "git2u > ius centos". Tried git2u, didn't help. I wouldn't expect external processes such as nginx or git-http-backend having any impact on leaks unless it's slower/faster and triggers a race in public-inbox. At this point I suspect it's timing-related in our code but maybe not something my hardware/VM is capable of reproducing. What level of parallelism is grokmirror using? I am very curious if anybody else sees it, too... > My access over the next 10 days is going to be spotty due to travel, but > I'll try to catch up at least once a day. The /proc fdinfo of the eventpoll descriptor+lsof at the same time would be helpful if you get a moment. But it's understandable if you don't :> > Thanks again for your help! No problem. I've gone through the code again but nothing jumped out at me. Will add some comments and dig into it again after I do some other stuff. Enjoy your travels! -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
On Fri, Sep 13, 2019 at 03:12:12AM +, Eric Wong wrote: > > No, it's just vanilla what comes with the source. > > OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) Correct, though another thing that may help is the git version -- we install git2u from IUS, so the git version is not what is in standard CentOS7 (it's too ancient). # rpm -q git2u git2u-2.16.5-1.ius.centos7.x86_64 See if you have more luck reproducing the problem when you get that in place. You should be able to find what you need by searching for "git2u ius centos". My access over the next 10 days is going to be spotty due to travel, but I'll try to catch up at least once a day. Thanks again for your help! -K
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Eric Wong wrote: > # unrelated note: there's 4 pipes from -W1 (the default), > # but I think 2 can be closed, actually... Nevermind on the extra pipes, I was running lsof on the master :x. "[RFC 2/1] t/git-http-backend: disable worker processes" cuts it down to a single process (no master/worker distinction) so lsof is seeing the right pid. > GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t And now no pipes are visible at all, so either that test can't trigger it, or neither of the systems I'm testing on can't trigger it... -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Eric Wong wrote: > Konstantin Ryabitsev wrote: > > I sent you a dump of lsof -p of all 4 processes after about 20 minutes > > of running. For another data point, the daemon was running in > > SELinux-permissive mode, to make sure unlinks aren't failing because of > > any permission errors. Oh, another curiosity. It would be good to know the /proc/$pid/fdinfo/$fd_of_eventpoll to know what state the pipes in the epoll descriptor to tie those with lsof output. It won't help with deleted regular files, though... -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Konstantin Ryabitsev wrote: > On Thu, Sep 12, 2019 at 08:35:03AM +, Eric Wong wrote: > > Eric Wong wrote: > > > One more thing, are you running any extra middlewares in the > > > .psgi file? Thanks. > > No, it's just vanilla what comes with the source. OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) > > That's probably not it, I suspected the non-fileno path was > > being hit, but I just tested a debug change on top of > > b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") > > ("tmpfile: support O_APPEND and use it in DS::tmpio") > > to fake the presence of a middleware wrapping psgi.input. > > I sent you a dump of lsof -p of all 4 processes after about 20 minutes > of running. For another data point, the daemon was running in > SELinux-permissive mode, to make sure unlinks aren't failing because of > any permission errors. It looks like there's a Perl reference leak (cycle) of some sort holding on to FDs, since you have lots of input files and pipes, yet only one established IPv4 connection. And the inodes encoded into the filenames don't point to the connected socket, even However, I'm not able to reproduce it on my CentOS 7 VM which has nginx 1.12.2. I don't think nginx is a factor in this since public-inbox-httpd is clearly not holding TCP sockets open, even. Not at all familiar with SELinux, but I'm just using the defaults CentOS comes with and running both nginx + public-inbox-httpd as a regular user. That "if (0..." GitHTTPBackend patch definitely isn't needed for testing anymore and only makes FD exhaustion happen sooner. > Let me know if you would like any further info. If there's a reference leak somewhere, this could also be part of the high memory use you showed us a few months ago. Dunno if you had many FDs back then. I could see about adding explicit close() calls in a few places, but that would make a corresponding memory leak harder-to-notice, even. I pushed out two patches to the "unlink" branch which may be able to reproduce the issue on your end (I see nothing out of the ordinary on my slow CentOS 7 VM or Debian machines/VMs) * [PATCH] t/httpd-corner: check for leaking FDs and pipes * [RFC] t/git-http-backend: add MANY_CLONE test # no pipes should be present in -httpd with -W0 prove -lv t/httpd-corner.t # unrelated note: there's 4 pipes from -W1 (the default), # but I think 2 can be closed, actually... GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t If those updated test cases can't reproduce the problem, can you reproduce this on erol or any other machines? perhaps with a different Perl? Thanks. -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
On Thu, Sep 12, 2019 at 08:35:03AM +, Eric Wong wrote: > Eric Wong wrote: > > One more thing, are you running any extra middlewares in the > > .psgi file? Thanks. No, it's just vanilla what comes with the source. > That's probably not it, I suspected the non-fileno path was > being hit, but I just tested a debug change on top of > b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") > ("tmpfile: support O_APPEND and use it in DS::tmpio") > to fake the presence of a middleware wrapping psgi.input. I sent you a dump of lsof -p of all 4 processes after about 20 minutes of running. For another data point, the daemon was running in SELinux-permissive mode, to make sure unlinks aren't failing because of any permission errors. Let me know if you would like any further info. Best, -K
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Eric Wong wrote: > One more thing, are you running any extra middlewares in the > .psgi file? Thanks. That's probably not it, I suspected the non-fileno path was being hit, but I just tested a debug change on top of b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") ("tmpfile: support O_APPEND and use it in DS::tmpio") to fake the presence of a middleware wrapping psgi.input. I've also pushed out an "unlink-fix" branch on top of f4f0a3be0864721d90f9557ffe1c513b0289a74b which only features one bugfix which wouldn't affect you, as well as the tmpfile change to give more meanigful names: solvergit: ensure Qspawn usage doesn't unref update-index stdin tmpfile: give temporary files meaningful names Note: if reusing existing working tree, be sure to rerun "perl Makefile.PL" since new .pm files won't get picked up, otherwise. # force creation of extra temporary file for psgi.input # in case some middleware overrides the psgi.input we # set in HTTP.pm: 8<-- diff --git a/lib/PublicInbox/GitHTTPBackend.pm b/lib/PublicInbox/GitHTTPBackend.pm index a8337035..2eaf221e 100644 --- a/lib/PublicInbox/GitHTTPBackend.pm +++ b/lib/PublicInbox/GitHTTPBackend.pm @@ -216,7 +216,7 @@ sub input_prepare { my $input = $env->{'psgi.input'}; my $fd = eval { fileno($input) }; - if (defined $fd && $fd >= 0) { + if (0 && defined $fd && $fd >= 0) { return { 0 => $fd }; } my $id = "git-http.input.$env->{REMOTE_HOST}:$env->{REMOTE_PORT}"; -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
One more thing, are you running any extra middlewares in the .psgi file? Thanks. -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Konstantin Ryabitsev wrote: > On Wed, Sep 11, 2019 at 05:12:50PM +, Eric Wong wrote: > > Konstantin Ryabitsev wrote: > > > To give some more data points, downgrading to f4f0a3be still shows a > > > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > > > the same over time. If I switch to the latest master, the number of > > > broken pipes grows steadily following each git pull (in the hundreds > > > after 10 minutes of running). > > > > Thanks for that info, did those deleted entries eventually go > > away (perhaps after several minutes)? > > They don't appear to go away -- since 10:23 UTC earlier today, they > accumulated over 2,400 entries: > > # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l > 2427 > # ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l > 26 > > Curiously, I also have this datapoint that may or may not be making > things more confusing. :) > > # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | > sort | uniq -c > 695 10:24 > 356 10:27 > 843 14:45 > 175 14:46 > 6 14:50 > 372 17:19 > 4 17:20 > 11 17:23 > > So, they appear to show up in chunks and hang around together. Any other cronjobs or events which may fire around then? Anything in nginx error logs? I've never seen anything like this. It should not matter as HTTP should be HTTP; but I've always had -httpd talking to varnish (mainly out of laziness, but I've never used nginx with public-inbox myself, but also ease of documentation since the varnish-4.vcl example teaches Varnish to only cache GET and HEAD requests). > When setting up lore, I also considered serving git via git-http-backend > instead of passing it via public-inbox-httpd -- the only reason we > didn't try that was because it complicated SELinux bits a bit. Can you > think of a reason why we shouldn't server git requests directly via > git-http-backend? I didn't realize nginx added the ability to spawn CGI processes, (but I haven't really looked at nginx this decade). public-inbox-httpd has the qspawn/limiter stuff and the same 32-process default limit as git-daemon. I actually limit it to 1-2 processes depending on the size of inbox given the tiny VPS which powers public-inbox.org. It's all documented in the -config manpage ("NAMED LIMITER"/"httpbackendmax"). My favorite feature is being able to configure different inboxes for different queues, so clone requests for large inboxes won't starve requests to small inboxes. My goal is to allow -httpd to do HTTPS and proxy to varnish soonish (eliminating the need for nginx or similar), and eventually eliminate the need for varnish entirely. > Thanks for your help. No problem; thanks for the extra info. Will try to finish my current (non-public) tasks so I can dig into this ASAP. -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
On Wed, Sep 11, 2019 at 05:12:50PM +, Eric Wong wrote: > Konstantin Ryabitsev wrote: > > To give some more data points, downgrading to f4f0a3be still shows a > > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > > the same over time. If I switch to the latest master, the number of > > broken pipes grows steadily following each git pull (in the hundreds > > after 10 minutes of running). > > Thanks for that info, did those deleted entries eventually go > away (perhaps after several minutes)? They don't appear to go away -- since 10:23 UTC earlier today, they accumulated over 2,400 entries: # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l 2427 # ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l 26 Curiously, I also have this datapoint that may or may not be making things more confusing. :) # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | sort | uniq -c 695 10:24 356 10:27 843 14:45 175 14:46 6 14:50 372 17:19 4 17:20 11 17:23 So, they appear to show up in chunks and hang around together. For nginx configs, this is the relevant part: server { listen *:80 default_server; proxy_buffering off; location ~ ^(.*/(HEAD|info/refs|objects/info/[^/]+|git-(upload|receive)-pack))$ { proxy_passhttp://localhost:8080; proxy_read_timeout90; proxy_connect_timeout 90; proxy_redirectoff; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; proxy_set_header X-Forwarded-Port $server_port; } } When setting up lore, I also considered serving git via git-http-backend instead of passing it via public-inbox-httpd -- the only reason we didn't try that was because it complicated SELinux bits a bit. Can you think of a reason why we shouldn't server git requests directly via git-http-backend? Thanks for your help. -K
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
Konstantin Ryabitsev wrote: > To give some more data points, downgrading to f4f0a3be still shows a > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > the same over time. If I switch to the latest master, the number of > broken pipes grows steadily following each git pull (in the hundreds > after 10 minutes of running). Thanks for that info, did those deleted entries eventually go away (perhaps after several minutes)? Kinda difficult to tie those to the corresponding TCP connection atm, though. I should probably put the TCP socket inode number into the pre-unlink tempfile name to make that relationship traceable. > I've downgraded lore.kernel.org to f4f0a3be for now to avoid the > resource exhaustion. I'll try to reproduce it later today or this week. -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/
Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]
On Wed, Sep 11, 2019 at 02:22:15AM +, Eric Wong wrote: > Btw, some of the changes in public-inbox to use EPOLLET / > EPOLLONESHOT would make it more sensitive to kernel bugs, > and missed wakeups could cause resource exhaustion... > > I just noticed this now: > https://lore.kernel.org/lkml/20190904175303.918854...@linuxfoundation.org/ > > And I remember I found another mm-compaction-related issue > several years back (early 3.x) which caused epoll wakeups to be > missed... To give some more data points, downgrading to f4f0a3be still shows a number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays the same over time. If I switch to the latest master, the number of broken pipes grows steadily following each git pull (in the hundreds after 10 minutes of running). I've downgraded lore.kernel.org to f4f0a3be for now to avoid the resource exhaustion. -K