Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-13 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Fri, Sep 13, 2019 at 03:12:12AM +, Eric Wong wrote:
> > > No, it's just vanilla what comes with the source.
> > 
> > OK, and Perl 5.16.3 from CentOS 7?  (4:5.16.3-294.el7_6 RPM)
> 
> Correct, though another thing that may help is the git version -- we
> install git2u from IUS, so the git version is not what is in standard
> CentOS7 (it's too ancient).
> 
> # rpm -q git2u
> git2u-2.16.5-1.ius.centos7.x86_64
> 
> See if you have more luck reproducing the problem when you get that in
> place. You should be able to find what you need by searching for "git2u
> ius centos".

Tried git2u, didn't help.  I wouldn't expect external processes
such as nginx or git-http-backend having any impact on leaks
unless it's slower/faster and triggers a race in public-inbox.

At this point I suspect it's timing-related in our code but
maybe not something my hardware/VM is capable of reproducing.

What level of parallelism is grokmirror using?

I am very curious if anybody else sees it, too...

> My access over the next 10 days is going to be spotty due to travel, but
> I'll try to catch up at least once a day.

The /proc fdinfo of the eventpoll descriptor+lsof at the same
time would be helpful if you get a moment.  But it's understandable
if you don't :>

> Thanks again for your help!

No problem.  I've gone through the code again but nothing jumped
out at me.  Will add some comments and dig into it again after
I do some other stuff.

Enjoy your travels!
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-13 Thread Konstantin Ryabitsev
On Fri, Sep 13, 2019 at 03:12:12AM +, Eric Wong wrote:
> > No, it's just vanilla what comes with the source.
> 
> OK, and Perl 5.16.3 from CentOS 7?  (4:5.16.3-294.el7_6 RPM)

Correct, though another thing that may help is the git version -- we
install git2u from IUS, so the git version is not what is in standard
CentOS7 (it's too ancient).

# rpm -q git2u
git2u-2.16.5-1.ius.centos7.x86_64

See if you have more luck reproducing the problem when you get that in
place. You should be able to find what you need by searching for "git2u
ius centos".

My access over the next 10 days is going to be spotty due to travel, but
I'll try to catch up at least once a day.

Thanks again for your help!

-K



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-13 Thread Eric Wong
Eric Wong  wrote:
> # unrelated note: there's 4 pipes from -W1 (the default),
> # but I think 2 can be closed, actually...

Nevermind on the extra pipes, I was running lsof on the master :x.
"[RFC 2/1] t/git-http-backend: disable worker processes"
cuts it down to a single process (no master/worker distinction)
so lsof is seeing the right pid.

> GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t

And now no pipes are visible at all, so either that test can't
trigger it, or neither of the systems I'm testing on can't trigger
it...
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-13 Thread Eric Wong
Eric Wong  wrote:
> Konstantin Ryabitsev  wrote:
> > I sent you a dump of lsof -p of all 4 processes after about 20 minutes
> > of running. For another data point, the daemon was running in
> > SELinux-permissive mode, to make sure unlinks aren't failing because of
> > any permission errors.

Oh, another curiosity.  It would be good to know the
/proc/$pid/fdinfo/$fd_of_eventpoll to know what state
the pipes in the epoll descriptor to tie those with lsof output.
It won't help with deleted regular files, though...
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-12 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Thu, Sep 12, 2019 at 08:35:03AM +, Eric Wong wrote:
> > Eric Wong  wrote:
> > > One more thing, are you running any extra middlewares in the
> > > .psgi file?  Thanks.
> 
> No, it's just vanilla what comes with the source.

OK, and Perl 5.16.3 from CentOS 7?  (4:5.16.3-294.el7_6 RPM)

> > That's probably not it, I suspected the non-fileno path was
> > being hit, but I just tested a debug change on top of
> > b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink")
> > ("tmpfile: support O_APPEND and use it in DS::tmpio")
> > to fake the presence of a middleware wrapping psgi.input.
> 
> I sent you a dump of lsof -p of all 4 processes after about 20 minutes
> of running. For another data point, the daemon was running in
> SELinux-permissive mode, to make sure unlinks aren't failing because of
> any permission errors.

It looks like there's a Perl reference leak (cycle) of some
sort holding on to FDs, since you have lots of input files and
pipes, yet only one established IPv4 connection.  And the inodes
encoded into the filenames don't point to the connected socket,
even

However, I'm not able to reproduce it on my CentOS 7 VM which
has nginx 1.12.2.  I don't think nginx is a factor in this
since public-inbox-httpd is clearly not holding TCP sockets
open, even.

Not at all familiar with SELinux, but I'm just using the
defaults CentOS comes with and running both nginx +
public-inbox-httpd as a regular user.

That "if (0..." GitHTTPBackend patch definitely isn't needed
for testing anymore and only makes FD exhaustion happen sooner.

> Let me know if you would like any further info.

If there's a reference leak somewhere, this could also be part
of the high memory use you showed us a few months ago.  Dunno
if you had many FDs back then.

I could see about adding explicit close() calls in a few
places, but that would make a corresponding memory leak
harder-to-notice, even.

I pushed out two patches to the "unlink" branch which may be
able to reproduce the issue on your end (I see nothing out of
the ordinary on my slow CentOS 7 VM or Debian machines/VMs)

* [PATCH] t/httpd-corner: check for leaking FDs and pipes
* [RFC] t/git-http-backend: add MANY_CLONE test

# no pipes should be present in -httpd with -W0
prove -lv t/httpd-corner.t

# unrelated note: there's 4 pipes from -W1 (the default),
# but I think 2 can be closed, actually...
GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t

If those updated test cases can't reproduce the problem,
can you reproduce this on erol or any other machines?
perhaps with a different Perl?

Thanks.
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-12 Thread Konstantin Ryabitsev
On Thu, Sep 12, 2019 at 08:35:03AM +, Eric Wong wrote:
> Eric Wong  wrote:
> > One more thing, are you running any extra middlewares in the
> > .psgi file?  Thanks.

No, it's just vanilla what comes with the source.

> That's probably not it, I suspected the non-fileno path was
> being hit, but I just tested a debug change on top of
> b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink")
> ("tmpfile: support O_APPEND and use it in DS::tmpio")
> to fake the presence of a middleware wrapping psgi.input.

I sent you a dump of lsof -p of all 4 processes after about 20 minutes
of running. For another data point, the daemon was running in
SELinux-permissive mode, to make sure unlinks aren't failing because of
any permission errors.

Let me know if you would like any further info.

Best,
-K



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-12 Thread Eric Wong
Eric Wong  wrote:
> One more thing, are you running any extra middlewares in the
> .psgi file?  Thanks.

That's probably not it, I suspected the non-fileno path was
being hit, but I just tested a debug change on top of
b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink")
("tmpfile: support O_APPEND and use it in DS::tmpio")
to fake the presence of a middleware wrapping psgi.input.



I've also pushed out an "unlink-fix" branch on top of
f4f0a3be0864721d90f9557ffe1c513b0289a74b which only
features one bugfix which wouldn't affect you, as well
as the tmpfile change to give more meanigful names:

  solvergit: ensure Qspawn usage doesn't unref update-index stdin
  tmpfile: give temporary files meaningful names

Note: if reusing existing working tree, be sure to rerun
"perl Makefile.PL" since new .pm files won't get picked up,
otherwise.

# force creation of extra temporary file for psgi.input
# in case some middleware overrides the psgi.input we
# set in HTTP.pm:
8<--
diff --git a/lib/PublicInbox/GitHTTPBackend.pm 
b/lib/PublicInbox/GitHTTPBackend.pm
index a8337035..2eaf221e 100644
--- a/lib/PublicInbox/GitHTTPBackend.pm
+++ b/lib/PublicInbox/GitHTTPBackend.pm
@@ -216,7 +216,7 @@ sub input_prepare {
 
my $input = $env->{'psgi.input'};
my $fd = eval { fileno($input) };
-   if (defined $fd && $fd >= 0) {
+   if (0 && defined $fd && $fd >= 0) {
return { 0 => $fd };
}
my $id = "git-http.input.$env->{REMOTE_HOST}:$env->{REMOTE_PORT}";
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-11 Thread Eric Wong
One more thing, are you running any extra middlewares in the
.psgi file?  Thanks.
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-11 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Wed, Sep 11, 2019 at 05:12:50PM +, Eric Wong wrote:
> > Konstantin Ryabitsev  wrote:
> > > To give some more data points, downgrading to f4f0a3be still shows a
> > > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays
> > > the same over time. If I switch to the latest master, the number of
> > > broken pipes grows steadily following each git pull (in the hundreds
> > > after 10 minutes of running).
> > 
> > Thanks for that info, did those deleted entries eventually go
> > away (perhaps after several minutes)?
> 
> They don't appear to go away -- since 10:23 UTC earlier today, they
> accumulated over 2,400 entries:
> 
> # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l
> 2427
> # ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l
> 26
> 
> Curiously, I also have this datapoint that may or may not be making
> things more confusing. :)
> 
> # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | 
> sort | uniq -c
> 695 10:24
> 356 10:27
> 843 14:45
> 175 14:46
>   6 14:50
> 372 17:19
>   4 17:20
>  11 17:23
> 
> So, they appear to show up in chunks and hang around together.

Any other cronjobs or events which may fire around then?

Anything in nginx error logs?

I've never seen anything like this.  It should not matter as
HTTP should be HTTP; but I've always had -httpd talking to varnish
(mainly out of laziness, but I've never used nginx with
public-inbox myself, but also ease of documentation since
the varnish-4.vcl example teaches Varnish to only cache GET and
HEAD requests).

> When setting up lore, I also considered serving git via git-http-backend
> instead of passing it via public-inbox-httpd -- the only reason we
> didn't try that was because it complicated SELinux bits a bit. Can you
> think of a reason why we shouldn't server git requests directly via
> git-http-backend?

I didn't realize nginx added the ability to spawn CGI processes,
(but I haven't really looked at nginx this decade).

public-inbox-httpd has the qspawn/limiter stuff and the same
32-process default limit as git-daemon.  I actually limit it to
1-2 processes depending on the size of inbox given the tiny VPS
which powers public-inbox.org.  It's all documented in the -config
manpage ("NAMED LIMITER"/"httpbackendmax").  My favorite feature
is being able to configure different inboxes for different queues,
so clone requests for large inboxes won't starve requests to
small inboxes.

My goal is to allow -httpd to do HTTPS and proxy to varnish
soonish (eliminating the need for nginx or similar),
and eventually eliminate the need for varnish entirely.

> Thanks for your help.

No problem; thanks for the extra info.  Will try to finish my
current (non-public) tasks so I can dig into this ASAP.
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-11 Thread Konstantin Ryabitsev
On Wed, Sep 11, 2019 at 05:12:50PM +, Eric Wong wrote:
> Konstantin Ryabitsev  wrote:
> > To give some more data points, downgrading to f4f0a3be still shows a
> > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays
> > the same over time. If I switch to the latest master, the number of
> > broken pipes grows steadily following each git pull (in the hundreds
> > after 10 minutes of running).
> 
> Thanks for that info, did those deleted entries eventually go
> away (perhaps after several minutes)?

They don't appear to go away -- since 10:23 UTC earlier today, they
accumulated over 2,400 entries:

# ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l
2427
# ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l
26

Curiously, I also have this datapoint that may or may not be making
things more confusing. :)

# ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | 
sort | uniq -c
695 10:24
356 10:27
843 14:45
175 14:46
  6 14:50
372 17:19
  4 17:20
 11 17:23

So, they appear to show up in chunks and hang around together.

For nginx configs, this is the relevant part:

server {
  listen *:80 default_server;

  proxy_buffering off;

  location ~ 
^(.*/(HEAD|info/refs|objects/info/[^/]+|git-(upload|receive)-pack))$ {

proxy_passhttp://localhost:8080;
proxy_read_timeout90;
proxy_connect_timeout 90;
proxy_redirectoff;

proxy_set_header  Host $host;
proxy_set_header  X-Real-IP $remote_addr;
proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header  X-Forwarded-Proto $http_x_forwarded_proto;
proxy_set_header  X-Forwarded-Port $server_port;

  }
}

When setting up lore, I also considered serving git via git-http-backend
instead of passing it via public-inbox-httpd -- the only reason we
didn't try that was because it complicated SELinux bits a bit. Can you
think of a reason why we shouldn't server git requests directly via
git-http-backend?

Thanks for your help.

-K



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-11 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> To give some more data points, downgrading to f4f0a3be still shows a
> number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays
> the same over time. If I switch to the latest master, the number of
> broken pipes grows steadily following each git pull (in the hundreds
> after 10 minutes of running).

Thanks for that info, did those deleted entries eventually go
away (perhaps after several minutes)?

Kinda difficult to tie those to the corresponding TCP connection
atm, though.  I should probably put the TCP socket inode number
into the pre-unlink tempfile name to make that relationship
traceable.

> I've downgraded lore.kernel.org to f4f0a3be for now to avoid the
> resource exhaustion.

I'll try to reproduce it later today or this week.
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/



Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...]

2019-09-11 Thread Konstantin Ryabitsev
On Wed, Sep 11, 2019 at 02:22:15AM +, Eric Wong wrote:
> Btw, some of the changes in public-inbox to use EPOLLET /
> EPOLLONESHOT would make it more sensitive to kernel bugs,
> and missed wakeups could cause resource exhaustion...
> 
> I just noticed this now:
> https://lore.kernel.org/lkml/20190904175303.918854...@linuxfoundation.org/
> 
> And I remember I found another mm-compaction-related issue
> several years back (early 3.x) which caused epoll wakeups to be
> missed...

To give some more data points, downgrading to f4f0a3be still shows a
number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays
the same over time. If I switch to the latest master, the number of
broken pipes grows steadily following each git pull (in the hundreds
after 10 minutes of running).

I've downgraded lore.kernel.org to f4f0a3be for now to avoid the
resource exhaustion.

-K