subject:"Weirdest Tomcat Behavior Ever\?"

Re: Weirdest Tomcat Behavior Ever?

2020-12-06 Thread Mark Thomas

On 04/12/2020 20:36, Eric Robinson wrote:

> We'll begin the packet captures and straces again on Monday. Mark, besides 
> network and fd tracing, is there anything else you want from strace to make 
> the analysis more certain?

I think the following should be sufficient:

- Wireshark trace for the Tomcat instance where the issue was observed
- debug logs for the same Tomcat instance
- fd trace for the same Tomcat instance
- a thread dump of the Java process as soon as practical after the issue

Because you are now using longer lasting connections we may need to look
at a longer period of data to figure out exactly what is happening. To
reduce the amount of data we pass around, if I need more data I'll ask
for it. The important thing is that you keep hold of a copy.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-12-04 Thread Eric Robinson

> -Original Message-
> From: Christopher Schultz 
> Sent: Wednesday, December 2, 2020 10:21 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Mark,
>
> On 11/26/20 05:14, Mark Thomas wrote:
> > On 26/11/2020 04:57, Christopher Schultz wrote:
> >
> > 
> >
> >>> After a normal clean-up the parent then calls close on the two file
> >>> descriptors associated with the pipe for a second time."
> >>
> >> So the child cleans them up AND the parent cleans them up? Or the
> >> parent cleans when up twice? The child should be able to call close()
> >> as many times as it wants and only poison itself. Does the child
> >> process ever exit()?
> >
> > With the caveat that some of the below is educated guess work because
> > the strace was configured to look at the events we were interested in
> > so I am having to fill in some of the gaps.
> >
> > The parent "process" is a Java thread currently in native code in a
> > 3rd party library.
> >
> > The parent creates a pipe which comes with two file descriptors. One
> > for the read end, one for the write end.
> >
> > The parent process then forks. The child process now has copies of the
> > two file descriptors. (see man fork for more details).
> >
> > The parent closes its fd for the write end of the pipe. The child
> > closes its fd for the read end of the pipe.
> >
> > The child writes to the pipe and the parent reads from it.
> >
> > The child exits and closes its fd for the write end of the pipe.
> >
> > The parent closes its fd for the read end of the pipe.
> >
> > At this point all is good. All the closes completely cleanly.
> > Everything has been shutdown properly.
>
> +1
>
> > The two fds allocated to the parent are back in the pool any may be
> > reused by other threads in the JVM.
> >
> > The parent then attempts to close the fds associated with the pipe
> > again. For each fd, if it has not been reallocated an EBADF error
> > occurs. If it has been reallocated, it is closed thereby breaking
> > whatever was using it legitimately.
>
> Thanks for clarifying this. I was confused and thinking you were saying that
> the child process was the one breaking things, but it's the parent process.
> Since the parent is the JVM (the long-running process), all hell breaks loose.
>
> >> The parent process must be the JVM process, right? And the parent
> >> process (this native library, running within the JVM process)
> >> double-closes file descriptors, with some measurable delay?
> >
> > Correct. In the instance where I did most of the log analysis the
> > delay was about 0.1 seconds. In other logs I did observe longer delays
> > with what looked like a very different failure mode.
> >
> >> That's the
> >> only way this could make sense. And of course it mess mess everything
> >> up in really *really* unpredictable ways.
> >
> > Yep.
>
> Fascinating.
>
> Thanks for the wild ride, Eric and Mark :)
>
> -chris
>

I case anyone thought I had forgotten about all this... sorry, no such luck. 
You’re stuck with me!

Things have been quieter for the past several days because, at Mark's 
suggestion, we changed the nginx proxy and tomcat configurations to make them 
more tolerant of the underlying condition. Specifically, we configured nginx to 
use HTTP/1.1 instead of 1.0, enabled keepalives, and set 
maxKeepAliveRequests="-1"  in server.xml. This reduced the frequency of the 
issue.

The vendor was unable to dispute the quality of the analysis, so they accepted 
that the third-party component (which I can now name: PDFTron) could be the 
root cause. They disabled use of the component, and that seems to have quieting 
things down a bit more. We are still seeing occasional session disconnects, so 
it is possible that the component is leveraged for more than one function in 
their software and it was only disabled for a subset of them. The big 
difference now is that, instead of seeing a GET request from the proxy followed 
by a FIN from the upstream, now it’s a GET followed by an RST.

We'll begin the packet captures and straces again on Monday. Mark, besides 
network and fd tracing, is there anything else you want from strace to make the 
analysis more certain?

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely t

Re: Weirdest Tomcat Behavior Ever?

2020-12-02 Thread Christopher Schultz


Mark,

On 11/26/20 05:14, Mark Thomas wrote:

On 26/11/2020 04:57, Christopher Schultz wrote:




After a normal clean-up the parent then calls close on the two file
descriptors associated with the pipe for a second time."


So the child cleans them up AND the parent cleans them up? Or the parent
cleans when up twice? The child should be able to call close() as many
times as it wants and only poison itself. Does the child process ever
exit()?


With the caveat that some of the below is educated guess work because
the strace was configured to look at the events we were interested in so
I am having to fill in some of the gaps.

The parent "process" is a Java thread currently in native code in a 3rd
party library.

The parent creates a pipe which comes with two file descriptors. One for
the read end, one for the write end.

The parent process then forks. The child process now has copies of the
two file descriptors. (see man fork for more details).

The parent closes its fd for the write end of the pipe. The child closes
its fd for the read end of the pipe.

The child writes to the pipe and the parent reads from it.

The child exits and closes its fd for the write end of the pipe.

The parent closes its fd for the read end of the pipe.

At this point all is good. All the closes completely cleanly. Everything
has been shutdown properly.


+1


The two fds allocated to the parent are back in the pool any may be
reused by other threads in the JVM.

The parent then attempts to close the fds associated with the pipe
again. For each fd, if it has not been reallocated an EBADF error
occurs. If it has been reallocated, it is closed thereby breaking
whatever was using it legitimately.


Thanks for clarifying this. I was confused and thinking you were saying 
that the child process was the one breaking things, but it's the parent 
process. Since the parent is the JVM (the long-running process), all 
hell breaks loose.



The parent process must be the JVM process, right? And the parent
process (this native library, running within the JVM process)
double-closes file descriptors, with some measurable delay?


Correct. In the instance where I did most of the log analysis the delay
was about 0.1 seconds. In other logs I did observe longer delays with
what looked like a very different failure mode.


That's the
only way this could make sense. And of course it mess mess everything up
in really *really* unpredictable ways.


Yep.


Fascinating.

Thanks for the wild ride, Eric and Mark :)

-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-26 Thread Mark Thomas

On 26/11/2020 04:57, Christopher Schultz wrote:

>> After a normal clean-up the parent then calls close on the two file
>> descriptors associated with the pipe for a second time."
> 
> So the child cleans them up AND the parent cleans them up? Or the parent
> cleans when up twice? The child should be able to call close() as many
> times as it wants and only poison itself. Does the child process ever
> exit()?

With the caveat that some of the below is educated guess work because
the strace was configured to look at the events we were interested in so
I am having to fill in some of the gaps.

The parent "process" is a Java thread currently in native code in a 3rd
party library.

The parent creates a pipe which comes with two file descriptors. One for
the read end, one for the write end.

The parent process then forks. The child process now has copies of the
two file descriptors. (see man fork for more details).

The parent closes its fd for the write end of the pipe. The child closes
its fd for the read end of the pipe.

The child writes to the pipe and the parent reads from it.

The child exits and closes its fd for the write end of the pipe.

The parent closes its fd for the read end of the pipe.

At this point all is good. All the closes completely cleanly. Everything
has been shutdown properly.

The two fds allocated to the parent are back in the pool any may be
reused by other threads in the JVM.

The parent then attempts to close the fds associated with the pipe
again. For each fd, if it has not been reallocated an EBADF error
occurs. If it has been reallocated, it is closed thereby breaking
whatever was using it legitimately.

> The parent process must be the JVM process, right? And the parent
> process (this native library, running within the JVM process)
> double-closes file descriptors, with some measurable delay?

Correct. In the instance where I did most of the log analysis the delay
was about 0.1 seconds. In other logs I did observe longer delays with
what looked like a very different failure mode.

> That's the
> only way this could make sense. And of course it mess mess everything up
> in really *really* unpredictable ways.

Yep.

> -chris
> 
> PS There are some cases where calling close(int) more than one time
> actually makes sense, but it's very rare. Evidently, HP-UX requires this
> if close() fails in certain ways, while most other systems prohibit
> double-calls to close() even when a failure occurs. So much for
> portability. Anyway, my guess is that this isn't an HP-UX library ported
> to Linux with a logic flaw introduced by that port. It's more likely
> just an implementation bug.

The original close calls don't fail so even if this were HP-UX code
ported to Linux this case does not apply.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Christopher Schultz


Mark,

On 11/25/20 22:30, Christopher Schultz wrote:

Mark,

On 11/25/20 12:49, Mark Thomas wrote:

On 25/11/2020 16:46, Christopher Schultz wrote:




Strange that the library is getting a direct reference to the response's
file descriptor. That seems (to me) to be a tactical mistake in an
implementation. Then again, I'm not looking at what is actually being
accomplished, there... it may make perfect sense in context.


The library creates a pipe and so gets two file descriptors. It closes
the file descriptors so they are returned to the OS's pool of file
descriptors. Shortly afterwards the OS reuses one of the file
descriptors and assigns it to the socket. So far, so good. Then the
library closes the file descriptors for a second time and that closes
the socket.


That would do it for most situations, but this is a child process 
closing a file descriptor twice. It shouldn't break the parent process, 
should it? I mean, if I could write a program that could steal fds from 
its parent process, that would be a pretty big security bug.


Or is the issue that this is actually a "thread process" and not a 
completely separate process, so they share a fd pool?


The more I think about this the less it makes sense unless I'm missing 
something.


I want to go back to your earlier statement:

"

strace shows the .so file being accessed, then we see the process
forking, a pipe set up between the parent and child and the child
calling execve on the same .so file.

"

So this is a typical fork-and-exec to start another process. The child 
inherits the fds from the parent. Calling execve means that the child 
process is replaced so this isn't a "thread" if it somehow started that 
way. (Specifically, this means that we aren't sharing a fd *table* 
between processes, even if some fds are actually mapped to the same OS 
fd between processes.)


The non-JVM native library is being invoked, and it's decided to 
fork-off a process ... of itself? Hmm... that's odd in a JVM 
environment, but ... okay, I guess.


"

After a normal clean-up the parent then calls close on the two file
descriptors associated with the pipe for a second time."


So the child cleans them up AND the parent cleans them up? Or the parent 
cleans when up twice? The child should be able to call close() as many 
times as it wants and only poison itself. Does the child process ever 
exit()?


The parent process must be the JVM process, right? And the parent 
process (this native library, running within the JVM process) 
double-closes file descriptors, with some measurable delay? That's the 
only way this could make sense. And of course it mess mess everything up 
in really *really* unpredictable ways.


-chris

PS There are some cases where calling close(int) more than one time 
actually makes sense, but it's very rare. Evidently, HP-UX requires this 
if close() fails in certain ways, while most other systems prohibit 
double-calls to close() even when a failure occurs. So much for 
portability. Anyway, my guess is that this isn't an HP-UX library ported 
to Linux with a logic flaw introduced by that port. It's more likely 
just an implementation bug.


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Christopher Schultz


Mark,

On 11/25/20 12:49, Mark Thomas wrote:

On 25/11/2020 16:46, Christopher Schultz wrote:




Strange that the library is getting a direct reference to the response's
file descriptor. That seems (to me) to be a tactical mistake in an
implementation. Then again, I'm not looking at what is actually being
accomplished, there... it may make perfect sense in context.


The library creates a pipe and so gets two file descriptors. It closes
the file descriptors so they are returned to the OS's pool of file
descriptors. Shortly afterwards the OS reuses one of the file
descriptors and assigns it to the socket. So far, so good. Then the
library closes the file descriptors for a second time and that closes
the socket.


That would do it for most situations, but this is a child process 
closing a file descriptor twice. It shouldn't break the parent process, 
should it? I mean, if I could write a program that could steal fds from 
its parent process, that would be a pretty big security bug.


Or is the issue that this is actually a "thread process" and not a 
completely separate process, so they share a fd pool?


-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Mark Thomas

On 25/11/2020 16:46, Christopher Schultz wrote:

> Strange that the library is getting a direct reference to the response's
> file descriptor. That seems (to me) to be a tactical mistake in an
> implementation. Then again, I'm not looking at what is actually being
> accomplished, there... it may make perfect sense in context.

The library creates a pipe and so gets two file descriptors. It closes
the file descriptors so they are returned to the OS's pool of file
descriptors. Shortly afterwards the OS reuses one of the file
descriptors and assigns it to the socket. So far, so good. Then the
library closes the file descriptors for a second time and that closes
the socket.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: [OT] Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Christopher Schultz

Eric,

On 11/25/20 09:34, Eric Robinson wrote:

-Original Message-
From: Christopher Schultz 
Sent: Tuesday, November 24, 2020 8:11 AM
To: users@tomcat.apache.org
Subject: Re: Weirdest Tomcat Behavior Ever?

Also, v5.0.8 is like 13 years old. Eric, you guys *really* have to upgrade that.
Somewhat surprisingly, there are CVEs against that library which allow
unauthenticated remote attackers to take-over the MySQL client
connections opened by that library.

Chris, I'm in full agreement with you on that. We'd love to update 
the connector but we are under vendor constraints. They only support

certain versions.

*eyeroll*

How about "the officially-supported version of the library causes 
ClassLoader leaks and leaves threads running when it shouldn't. Please 
support a non-ancient version of this library"?

You should be able to argue based solely upon security reports that them 
not supporting a newer version reduces the security of your system which 
is not acceptable. Read your SLAs to see if you can force them to do the 
right thing.

Honestly, dropping the newer version of the driver on top of the old one 
will work in nearly 100% of the cases. I have a lot of experience with 
the Connector/J library and I've never had an upgrade break anything.

Since I'm dumping on Connector/J right now, I just have to post this 
item from the changelog of the most recent 5.1.x release[1] because it's 
priceless:

"
Bugs Fixed

When trying to set a parameter for a PreparedStatement using the 
method PreparedStatement.setObject(parameterIndex, "false", 
Types.BOOLEAN), the value was set to true instead of false. (Bug 
#30911870, Bug #98237)

"

So, to recap: in MySQL Connector/J versions before 5.1.49, "false" means 
"true".

*sigh* This *is* year-2020, I guess.

What better reason to upgrade?

-chris

[1] https://dev.mysql.com/doc/relnotes/connector-j/5.1/en/news-5-1-49.html

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Christopher Schultz


Mark,

On 11/24/20 09:56, Mark Thomas wrote:

On 24/11/2020 14:11, Christopher Schultz wrote:

On 11/20/20 11:08, Mark Thomas wrote:





A second look at the strace file from yesterday provided hard evidence
of a native library mis-using file descriptors and strong circumstantial
evidence linking that same library to a specific instance of the
observed failure.


Interesting. How can you tell it's a library instead of, for example,
the JVM itself (which is of course itself a set of native libraries).


strace shows the .so file being accessed, then we see the process
forking, a pipe set up between the parent and child and the child
calling execve on the same .so file. After a normal clean-up the parent
then calls close on the two file descriptors associated with the pipe
for a second time.


Looks pretty straightforward. Clearly not Connector/J... at least not 
one with a normal configuration.



I'm as sure as I can be without access to the source code for the .so
file that it is mis-handling file descriptors.


I'm assuming that when you say "native library" you mean "some native
component which is not a part of the core JVM".


The .so file in question is not part of the JVM. It appears to be a
third-party native library that ships as part of the commercial web
application where the original issue is observed.


Aha!


TL;DR, an issue in an external library, not a Tomcat issue.





So does this really look like a (pretty serious) bug in a native
library? Any idea which one?


I'm reasonably sure but I had to make a couple of assumptions based on
file paths to ID the library. I've passed that info to Eric but until it
is confirmed it doesn't seem right to name it on list.


+1 and +1

Reporting this privately to the vendor (by Eric) seems appropriate to me.

Strange that the library is getting a direct reference to the response's 
file descriptor. That seems (to me) to be a tactical mistake in an 
implementation. Then again, I'm not looking at what is actually being 
accomplished, there... it may make perfect sense in context.


-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Tuesday, November 24, 2020 8:57 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 24/11/2020 14:11, Christopher Schultz wrote:
> > On 11/20/20 11:08, Mark Thomas wrote:
>
> 
>
> >> A second look at the strace file from yesterday provided hard
> >> evidence of a native library mis-using file descriptors and strong
> >> circumstantial evidence linking that same library to a specific
> >> instance of the observed failure.
> >
> > Interesting. How can you tell it's a library instead of, for example,
> > the JVM itself (which is of course itself a set of native libraries).
>
> strace shows the .so file being accessed, then we see the process forking, a
> pipe set up between the parent and child and the child calling execve on the
> same .so file. After a normal clean-up the parent then calls close on the two
> file descriptors associated with the pipe for a second time.
>
> I'm as sure as I can be without access to the source code for the .so file 
> that it
> is mis-handling file descriptors.
>
> > I'm assuming that when you say "native library" you mean "some native
> > component which is not a part of the core JVM".
>
> The .so file in question is not part of the JVM. It appears to be a 
> third-party
> native library that ships as part of the commercial web application where the
> original issue is observed.
>
> >> TL;DR, an issue in an external library, not a Tomcat issue.
>
> 
>
> > So does this really look like a (pretty serious) bug in a native
> > library? Any idea which one?
>
> I'm reasonably sure but I had to make a couple of assumptions based on file
> paths to ID the library. I've passed that info to Eric but until it is 
> confirmed it
> doesn't seem right to name it on list.
>
> Mark
>

The full evidence package was submitted to the application vendor this morning, 
including all relevant logs, packet captures, strace files, and the 
accompanying analysis (courtesy of Mark) which seems pretty conclusive. We're 
eager to hear their feedback. That said, I'm not too shy about mentioning the 
names of the suspected libraries, as long as we're clear that the cause is not 
confirmed, and specific about the evidence that points in that direction. I may 
do so in a follow-up posting. After all, we already know the names of the other 
major components involved--tomcat, java, and the jdbc connector. Naming the 
suspect libraries may do someone else a service who has a gut feeling about 
them but hasn't seen solid evidence to support their concerns.

-Eric

> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-25 Thread Eric Robinson

> -Original Message-
> From: Christopher Schultz 
> Sent: Tuesday, November 24, 2020 8:11 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Mark,
>
> On 11/20/20 11:08, Mark Thomas wrote:
> > On 20/11/2020 15:43, Eric Robinson wrote:
> >>> From: Mark Thomas 
> >
> >
> > 
> >
> >>> This information might turn out to be unnecessary. I have been
> >>> re-looking at the previous logs and I think I have enough evidence
> >>> to ID a root cause. I'll email off-list as I need to quote logs to prove 
> >>> my
> point.
> >>>
> >>
> >> I'll be watching my inbox like a hawk!
> >
> > Sent.
> >
> > For the curious, the analysis indicated either JVM code or a native
> > library was incorrectly closing the file descriptor associated with
> > the socket being used by an in progress HTTP request.
>
> FWIW, Connector/J is a Type 4 JDBC Driver (aka "pure Java"), so no native
> components.
>
> I'm not sure how it would be able to close the connection.
>
> Also, v5.0.8 is like 13 years old. Eric, you guys *really* have to upgrade 
> that.
> Somewhat surprisingly, there are CVEs against that library which allow
> unauthenticated remote attackers to take-over the MySQL client
> connections opened by that library.
>

Chris, I'm in full agreement with you on that. We'd love to update the 
connector but we are under vendor constraints. They only support certain 
versions.

> > A second look at the strace file from yesterday provided hard evidence
> > of a native library mis-using file descriptors and strong
> > circumstantial evidence linking that same library to a specific
> > instance of the observed failure.
>
> Interesting. How can you tell it's a library instead of, for example, the JVM
> itself (which is of course itself a set of native libraries).
> I'm assiming that when you say "native library" you mean "some native
> component which is not a part of the core JVM".
>
> > TL;DR, an issue in an external library, not a Tomcat issue.
>
> I've recently been trying to optimize IO throughput of certain web-to-
> database (and vice-versa) operations in my own web applications.
> The gist is that we are wrapping request InputStreams or JDBC InputStreams
> (depending upon whether we are reading from request->db or
> db->response) to allow streaming directly in or out of the database with
> a minimum of buffering.
>
> If such a scheme were to be implemented particularly badly (e.g.
> allowing the database code direct-access to the request and response
> objects instead of just the stream(s)), it would certainly be possible
> for such an implementation to close the response's output stream before
> the servlet was expecting it.
>
> But such an implementation would be caught red-handed if a simple
> wrapper around the response stream were to be installed and simply log
> all the calls to close(), which I think was one of your first debugging
> steps, here.
>
> So does this really look like a (pretty serious) bug in a native
> library? Any idea which one?
>
> -chris
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-11-24 Thread Mark Thomas

On 24/11/2020 14:11, Christopher Schultz wrote:
> On 11/20/20 11:08, Mark Thomas wrote:

>> A second look at the strace file from yesterday provided hard evidence
>> of a native library mis-using file descriptors and strong circumstantial
>> evidence linking that same library to a specific instance of the
>> observed failure.
> 
> Interesting. How can you tell it's a library instead of, for example,
> the JVM itself (which is of course itself a set of native libraries).

strace shows the .so file being accessed, then we see the process
forking, a pipe set up between the parent and child and the child
calling execve on the same .so file. After a normal clean-up the parent
then calls close on the two file descriptors associated with the pipe
for a second time.

I'm as sure as I can be without access to the source code for the .so
file that it is mis-handling file descriptors.

> I'm assuming that when you say "native library" you mean "some native
> component which is not a part of the core JVM".

The .so file in question is not part of the JVM. It appears to be a
third-party native library that ships as part of the commercial web
application where the original issue is observed.

>> TL;DR, an issue in an external library, not a Tomcat issue.

> So does this really look like a (pretty serious) bug in a native
> library? Any idea which one?

I'm reasonably sure but I had to make a couple of assumptions based on
file paths to ID the library. I've passed that info to Eric but until it
is confirmed it doesn't seem right to name it on list.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-24 Thread Christopher Schultz


Mark,

On 11/20/20 11:08, Mark Thomas wrote:

On 20/11/2020 15:43, Eric Robinson wrote:

From: Mark Thomas 






This information might turn out to be unnecessary. I have been re-looking at
the previous logs and I think I have enough evidence to ID a root cause. I'll
email off-list as I need to quote logs to prove my point.



I'll be watching my inbox like a hawk!


Sent.

For the curious, the analysis indicated either JVM code or a native
library was incorrectly closing the file descriptor associated with the
socket being used by an in progress HTTP request.


FWIW, Connector/J is a Type 4 JDBC Driver (aka "pure Java"), so no 
native components.


I'm not sure how it would be able to close the connection.

Also, v5.0.8 is like 13 years old. Eric, you guys *really* have to 
upgrade that. Somewhat surprisingly, there are CVEs against that library 
which allow unauthenticated remote attackers to take-over the MySQL 
client connections opened by that library.



A second look at the strace file from yesterday provided hard evidence
of a native library mis-using file descriptors and strong circumstantial
evidence linking that same library to a specific instance of the
observed failure.


Interesting. How can you tell it's a library instead of, for example, 
the JVM itself (which is of course itself a set of native libraries). 
I'm assiming that when you say "native library" you mean "some native 
component which is not a part of the core JVM".



TL;DR, an issue in an external library, not a Tomcat issue.


I've recently been trying to optimize IO throughput of certain 
web-to-database (and vice-versa) operations in my own web applications. 
The gist is that we are wrapping request InputStreams or JDBC 
InputStreams (depending upon whether we are reading from request->db or 
db->response) to allow streaming directly in or out of the database with 
a minimum of buffering.


If such a scheme were to be implemented particularly badly (e.g. 
allowing the database code direct-access to the request and response 
objects instead of just the stream(s)), it would certainly be possible 
for such an implementation to close the response's output stream before 
the servlet was expecting it.


But such an implementation would be caught red-handed if a simple 
wrapper around the response stream were to be installed and simply log 
all the calls to close(), which I think was one of your first debugging 
steps, here.


So does this really look like a (pretty serious) bug in a native 
library? Any idea which one?


-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-20 Thread Mark Thomas

On 20/11/2020 15:43, Eric Robinson wrote:
>> From: Mark Thomas 

>> This information might turn out to be unnecessary. I have been re-looking at
>> the previous logs and I think I have enough evidence to ID a root cause. I'll
>> email off-list as I need to quote logs to prove my point.
>>
> 
> I'll be watching my inbox like a hawk!

Sent.

For the curious, the analysis indicated either JVM code or a native
library was incorrectly closing the file descriptor associated with the
socket being used by an in progress HTTP request.

A second look at the strace file from yesterday provided hard evidence
of a native library mis-using file descriptors and strong circumstantial
evidence linking that same library to a specific instance of the
observed failure.

TL;DR, an issue in an external library, not a Tomcat issue.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-20 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, November 20, 2020 9:32 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 20/11/2020 14:55, Eric Robinson wrote:
> >> From: Mark Thomas 
> >> It looks like you are using MySQL from the data in the strace log. Is
> >> that correct? What I really need to know is which version of the
> >> MySQL JDBC driver (Connector/J) are you using? I'm trying to figure
> >> out where the root cause might be (or if it is a known issue fixed in a 
> >> later
> release).
> >>
> >
> > Yes, MySQL 5.6.41.
> >
> > Now sure how to tell the JDBC version. I unzipped the tomcat-jdbc.jar file
> and inspected the MANIFEST file, which shows the following. Is that what
> you need?
>
> That is one of the database connection pools provided by Tomcat. There
> should be a JAR somewhere called myswl-connector-java-x.y.x.jar or
> something close to that. It will either be in Tomcat's lib directory or under
> WEB-INF/lib in the webapp.
>

mysql-connector-java-commercial-5.0.8-bin.jar

> This information might turn out to be unnecessary. I have been re-looking at
> the previous logs and I think I have enough evidence to ID a root cause. I'll
> email off-list as I need to quote logs to prove my point.
>

I'll be watching my inbox like a hawk!

> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-11-20 Thread Mark Thomas

On 20/11/2020 14:55, Eric Robinson wrote:
>> From: Mark Thomas 
>> It looks like you are using MySQL from the data in the strace log. Is that
>> correct? What I really need to know is which version of the MySQL JDBC
>> driver (Connector/J) are you using? I'm trying to figure out where the root
>> cause might be (or if it is a known issue fixed in a later release).
>>
> 
> Yes, MySQL 5.6.41.
> 
> Now sure how to tell the JDBC version. I unzipped the tomcat-jdbc.jar file 
> and inspected the MANIFEST file, which shows the following. Is that what you 
> need?

That is one of the database connection pools provided by Tomcat. There
should be a JAR somewhere called myswl-connector-java-x.y.x.jar or
something close to that. It will either be in Tomcat's lib directory or
under WEB-INF/lib in the webapp.

This information might turn out to be unnecessary. I have been
re-looking at the previous logs and I think I have enough evidence to ID
a root cause. I'll email off-list as I need to quote logs to prove my point.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-20 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, November 20, 2020 3:17 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 19/11/2020 16:03, Mark Thomas wrote:
> > On 19/11/2020 15:55, Eric Robinson wrote:
>
> 
>
> >> Unfortunately, the tomcats get restarted every night, so I'll have to watch
> closely for the error to happen today and I will do a thread dump as close as
> possible to the time of the error.
> >
> > Sounds good. Tx.
> >
> > My guess based on what I saw in the strace is that the thread will be
> > something database related. Which database and which database driver
> > (version number etc) is in use? In particular, are there any native
> > components to the JDBC driver?
>
> Morning Eric,
>
> I got the latest logs. Unfortunately, it looks like the file descriptor 
> information
> wasn't included in the strace output. I'm not seeing any calls to close a file
> descriptor.
>

My fault again. I'll make sure it gets in there next time.

> The thread dump looks good. I can map information from strace to the
> thread dump successfully.
>
> It looks like you are using MySQL from the data in the strace log. Is that
> correct? What I really need to know is which version of the MySQL JDBC
> driver (Connector/J) are you using? I'm trying to figure out where the root
> cause might be (or if it is a known issue fixed in a later release).
>

Yes, MySQL 5.6.41.

Now sure how to tell the JDBC version. I unzipped the tomcat-jdbc.jar file and 
inspected the MANIFEST file, which shows the following. Is that what you need?

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.9.7
Created-By: 1.6.0_45-b06 (Sun Microsystems Inc.)
Export-Package: org.apache.tomcat.jdbc.naming;uses:="javax.naming,org.
 apache.juli.logging,javax.naming.spi";version="7.0.72",org.apache.tom
 cat.jdbc.pool;uses:="org.apache.juli.logging,javax.sql,org.apache.tom
 cat.jdbc.pool.jmx,javax.management,javax.naming,javax.naming.spi,org.
 apache.tomcat.jdbc.pool.interceptor";version="7.0.72",org.apache.tomc
 at.jdbc.pool.interceptor;uses:="org.apache.tomcat.jdbc.pool,org.apach
 e.juli.logging,javax.management.openmbean,javax.management";version="
 7.0.72",org.apache.tomcat.jdbc.pool.jmx;uses:="org.apache.tomcat.jdbc
 .pool,org.apache.juli.logging,javax.management";version="7.0.72"
Bundle-Vendor: Apache Software Foundation
Bundle-Version: 7.0.72
Bundle-Name: Apache Tomcat JDBC Connection Pool
Bundle-ManifestVersion: 2
Bundle-SymbolicName: org.apache.tomcat.jdbc
Import-Package:  javax.management;version="0", javax.management.openmb
 ean;version="0", javax.naming;version="0", javax.naming.spi;version="
 0", javax.sql;version="0", org.apache.juli.logging;version="0"

> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-11-20 Thread Mark Thomas

On 19/11/2020 16:03, Mark Thomas wrote:
> On 19/11/2020 15:55, Eric Robinson wrote:

>> Unfortunately, the tomcats get restarted every night, so I'll have to watch 
>> closely for the error to happen today and I will do a thread dump as close 
>> as possible to the time of the error.
> 
> Sounds good. Tx.
> 
> My guess based on what I saw in the strace is that the thread will be
> something database related. Which database and which database driver
> (version number etc) is in use? In particular, are there any native
> components to the JDBC driver?

Morning Eric,

I got the latest logs. Unfortunately, it looks like the file descriptor
information wasn't included in the strace output. I'm not seeing any
calls to close a file descriptor.

The thread dump looks good. I can map information from strace to the
thread dump successfully.

It looks like you are using MySQL from the data in the strace log. Is
that correct? What I really need to know is which version of the MySQL
JDBC driver (Connector/J) are you using? I'm trying to figure out where
the root cause might be (or if it is a known issue fixed in a later
release).

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-19 Thread Mark Thomas

On 19/11/2020 15:55, Eric Robinson wrote:
>> From: Mark Thomas 
>> Sent: Thursday, November 19, 2020 4:34 AM
>> To: users@tomcat.apache.org
>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>
>> On 18/11/2020 16:28, Mark Thomas wrote:
>>> On 18/11/2020 15:41, Eric Robinson wrote:
>>
>> 
>>
>>>> I tested it and we're now capturing file descriptor operations.
>>
>> 
>>
>>> I am very curious as to what we are going to see in these logs.
>>
>> Good news. We are a step closer to figuring out what is going on.
>>
>> The new strace logs show the file descriptor associated with the socket being
>> closed. That means that the socket close MUST be coming from within the
>> JVM. The key lines in the strace output that show this are (I've removed
>> unrelated events from between these entries):
>>
>> 19166 15:24:21.108653 (+ 0.20) dup2(50, 336 
>> 19166 15:24:21.108722 (+ 0.17) <... dup2 resumed>) = 336
>> 19166 15:24:21.108778 (+ 0.27) close(336 
>> 19166 15:24:21.109015 (+ 0.000152) <... close resumed>) = 0
>>
>> Has the Tomcat instance on the machine you obtained the last set of logs
>> from been restarted since you obtained the logs? If not it should be possible
>> to identify the thread that triggered the socket close as follows:
>>
> 
> Sounds promising!
> 
> Unfortunately, the tomcats get restarted every night, so I'll have to watch 
> closely for the error to happen today and I will do a thread dump as close as 
> possible to the time of the error.

Sounds good. Tx.

My guess based on what I saw in the strace is that the thread will be
something database related. Which database and which database driver
(version number etc) is in use? In particular, are there any native
components to the JDBC driver?

Mark


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-19 Thread Eric Robinson

> From: Mark Thomas 
> Sent: Thursday, November 19, 2020 4:34 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 18/11/2020 16:28, Mark Thomas wrote:
> > On 18/11/2020 15:41, Eric Robinson wrote:
>
> 
>
> >>I tested it and we're now capturing file descriptor operations.
>
> 
>
> > I am very curious as to what we are going to see in these logs.
>
> Good news. We are a step closer to figuring out what is going on.
>
> The new strace logs show the file descriptor associated with the socket being
> closed. That means that the socket close MUST be coming from within the
> JVM. The key lines in the strace output that show this are (I've removed
> unrelated events from between these entries):
>
> 19166 15:24:21.108653 (+ 0.20) dup2(50, 336 
> 19166 15:24:21.108722 (+ 0.17) <... dup2 resumed>) = 336
> 19166 15:24:21.108778 (+ 0.27) close(336 
> 19166 15:24:21.109015 (+ 0.000152) <... close resumed>) = 0
>
> Has the Tomcat instance on the machine you obtained the last set of logs
> from been restarted since you obtained the logs? If not it should be possible
> to identify the thread that triggered the socket close as follows:
>

Sounds promising!

Unfortunately, the tomcats get restarted every night, so I'll have to watch 
closely for the error to happen today and I will do a thread dump as close as 
possible to the time of the error.

> Using the following command should trigger the generation of a thread
> dump to standard out (which may be redirected to a file).
>
> kill -3 
>
> Locate the thread dump.
>
> Each thread lists its native thread id e.g. nid=0x3522
>
> Look for the thread that has nid=0x4ade
>
> What is the first line for that thread? It should look something like this:
>
> "http-bio-8080-exec-8" #24 daemon prio=5 os_prio=0 cpu=0.17ms
> elapsed=20.54s tid=0x7f6e546c0800 nid=0x374e waiting on condition
> [0x7f6db38f7000]
>
>
> If the Tomcat instance has been restarted the native thread IDs will have
> changed. To identify the thread we'll need to repeat the last set of logs and
> once a failure is detected, take a thread dump and provide that along with
> the logs and I should be able to join the dots.
>
> Almost there...
>
> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-11-19 Thread Mark Thomas

On 18/11/2020 16:28, Mark Thomas wrote:
> On 18/11/2020 15:41, Eric Robinson wrote:

>>I tested it and we're now capturing file descriptor operations.

> I am very curious as to what we are going to see in these logs.

Good news. We are a step closer to figuring out what is going on.

The new strace logs show the file descriptor associated with the socket
being closed. That means that the socket close MUST be coming from
within the JVM. The key lines in the strace output that show this are
(I've removed unrelated events from between these entries):

19166 15:24:21.108653 (+ 0.20) dup2(50, 336 
19166 15:24:21.108722 (+ 0.17) <... dup2 resumed>) = 336
19166 15:24:21.108778 (+ 0.27) close(336 
19166 15:24:21.109015 (+ 0.000152) <... close resumed>) = 0

Has the Tomcat instance on the machine you obtained the last set of logs
from been restarted since you obtained the logs? If not it should be
possible to identify the thread that triggered the socket close as follows:

Using the following command should trigger the generation of a thread
dump to standard out (which may be redirected to a file).

kill -3 

Locate the thread dump.

Each thread lists its native thread id e.g. nid=0x3522

Look for the thread that has nid=0x4ade

What is the first line for that thread? It should look something like this:

"http-bio-8080-exec-8" #24 daemon prio=5 os_prio=0 cpu=0.17ms
elapsed=20.54s tid=0x7f6e546c0800 nid=0x374e waiting on condition
[0x7f6db38f7000]

If the Tomcat instance has been restarted the native thread IDs will
have changed. To identify the thread we'll need to repeat the last set
of logs and once a failure is detected, take a thread dump and provide
that along with the logs and I should be able to join the dots.

Almost there...

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-18 Thread Mark Thomas

On 18/11/2020 15:41, Eric Robinson wrote:
>> -Original Message-
>> From: Mark Thomas 
>> Sent: Wednesday, November 18, 2020 3:03 AM
>> To: users@tomcat.apache.org
>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>
>> On 13/11/2020 23:46, Mark Thomas wrote:
>>> Eric sent me a copy of the strace (thanks Eric) and while it is
>>> consistent with what has already been observed, it didn't provide any
>>> new information on the socket / file descriptor being closed.
>>>
>>> I'd like to suggest running again with the following:
>>>
>>> sudo strace -r -f -e trace=network,desc -p 
>>>
>>> That should log the file descriptor being closed (and other fd
>>> activity). There are a couple of things we might be able to do with this:
>>>
>>> - we'll be able to determine if the socket is closed on the same or a
>>>   different thread
>>> - we might be able to correlate the time of closure with other logs
>>>   (seems unlikely as we have this from Wireshark but you never know)
>>> - the class before the close might be enlightening
>>
>> Hi Eric,
>>
>> I looked at the updated logs this morning. I don't see any additional logging
>> for file descriptors in the strace output.
>>
>> I wonder if you need a slightly different command on your platform?
>>
>> I'd expect to see entries like this:
>>
>> [pid  8062]  0.70 openat(AT_FDCWD,
>> "/home/mark/repos/asf-tomcat-master/output/build/webapps/ROOT/bg-
>> nav.png",
>> O_RDONLY) = 57
>> [pid  8062]  0.27 fstat(57,  
>> [pid  8062]  0.05 <... fstat resumed>{st_mode=S_IFREG|0664,
>> st_size=1401, ...}) = 0
>> [pid  8062]  0.43 read(57,  
>> [pid  8062]  0.33 <... read
>> resumed>"\211PNG\r\n\32\n\0\0\0\rIHDR\0\0\0\n\0\0\0002\10\6\0\0\0e\3
>> 3J".
>> resumed>..,
>> 1401) = 1401
>> [pid  8062]  0.13 close(57 
>>
>> showing file access although what I really want to see are the calls to close
>> the sockets (like the last two in the sequence below from a test where I used
>> telnet to perform an HTTP/1.0 request)
>>
>> pid  8069]  0.124099 <... accept resumed>{sa_family=AF_INET6,
>> sin6_port=htons(52656), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
>> ":::127.0.0.1", &sin6_addr), sin6_scope_id=0}, [28]) = 50 ...
>> [pid  8063]  0.000216 read(50,  
>> [pid  8063]  0.58 <... read resumed>"GET / HTTP/1.0\r\n", 8192) = 16
>> [pid  8063]  0.29 read(50,  
>> [pid  8063]  0.30 <... read resumed>0x7f4f6c000e70, 8192) = -1
>> EAGAIN (Resource temporarily unavailable)
>> [pid  8064]  0.001061 read(50, "Host: a\r\n", 8192) = 9
>> [pid  8064]  0.000239 read(50, 0x7f4f6e70, 8192) = -1 EAGAIN
>> (Resource temporarily unavailable)
>> [pid  8062]  0.000214 read(50, "\r\n", 8192) = 2
>> [pid  8062]  0.007897 write(50, "HTTP/1.1 200 \r\nContent-Type:
>> tex"..., 8192) = 8192
>> [pid  8062]  0.000353 write(50, ">Tomcat Native\n
>> "..., 3079) = 3079
>> [pid  8062]  0.002071 getsockopt(50, SOL_SOCKET, SO_LINGER,
>> {l_onoff=0, l_linger=0}, [8]) = 0
>> [pid  8062]  0.000102 shutdown(50, SHUT_WR) = 0
>> [pid  8068]  0.000342 close(50) = 0
>>
>> It is probably worth running a couple of quick tests to figure out the 
>> correct
>> form of the strace command on your platform and then retesting.
>>
>> Mark
>>
> 
> Entirely my fault. I'm new to strace, so I didn't know what to expect. I have 
> now read the strace man page and I'm more up to speed. I tested it and we're 
> now capturing file descriptor operations. The next batch of logs will be 
> better.

No worries. This strace stuff is new to me as well. I shouldn't have
assumed what worked on my Ubuntu desktop was going to work the same way
on your CentOS server.

I am very curious as to what we are going to see in these logs.

Cheers,

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-18 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Wednesday, November 18, 2020 3:03 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 13/11/2020 23:46, Mark Thomas wrote:
> > Eric sent me a copy of the strace (thanks Eric) and while it is
> > consistent with what has already been observed, it didn't provide any
> > new information on the socket / file descriptor being closed.
> >
> > I'd like to suggest running again with the following:
> >
> > sudo strace -r -f -e trace=network,desc -p 
> >
> > That should log the file descriptor being closed (and other fd
> > activity). There are a couple of things we might be able to do with this:
> >
> > - we'll be able to determine if the socket is closed on the same or a
> >   different thread
> > - we might be able to correlate the time of closure with other logs
> >   (seems unlikely as we have this from Wireshark but you never know)
> > - the class before the close might be enlightening
>
> Hi Eric,
>
> I looked at the updated logs this morning. I don't see any additional logging
> for file descriptors in the strace output.
>
> I wonder if you need a slightly different command on your platform?
>
> I'd expect to see entries like this:
>
> [pid  8062]  0.70 openat(AT_FDCWD,
> "/home/mark/repos/asf-tomcat-master/output/build/webapps/ROOT/bg-
> nav.png",
> O_RDONLY) = 57
> [pid  8062]  0.27 fstat(57,  
> [pid  8062]  0.05 <... fstat resumed>{st_mode=S_IFREG|0664,
> st_size=1401, ...}) = 0
> [pid  8062]  0.43 read(57,  
> [pid  8062]  0.33 <... read
> resumed>"\211PNG\r\n\32\n\0\0\0\rIHDR\0\0\0\n\0\0\0002\10\6\0\0\0e\3
> 3J".
> resumed>..,
> 1401) = 1401
> [pid  8062]  0.13 close(57 
>
> showing file access although what I really want to see are the calls to close
> the sockets (like the last two in the sequence below from a test where I used
> telnet to perform an HTTP/1.0 request)
>
> pid  8069]  0.124099 <... accept resumed>{sa_family=AF_INET6,
> sin6_port=htons(52656), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
> ":::127.0.0.1", &sin6_addr), sin6_scope_id=0}, [28]) = 50 ...
> [pid  8063]  0.000216 read(50,  
> [pid  8063]  0.58 <... read resumed>"GET / HTTP/1.0\r\n", 8192) = 16
> [pid  8063]  0.29 read(50,  
> [pid  8063]  0.30 <... read resumed>0x7f4f6c000e70, 8192) = -1
> EAGAIN (Resource temporarily unavailable)
> [pid  8064]  0.001061 read(50, "Host: a\r\n", 8192) = 9
> [pid  8064]  0.000239 read(50, 0x7f4f6e70, 8192) = -1 EAGAIN
> (Resource temporarily unavailable)
> [pid  8062]  0.000214 read(50, "\r\n", 8192) = 2
> [pid  8062]  0.007897 write(50, "HTTP/1.1 200 \r\nContent-Type:
> tex"..., 8192) = 8192
> [pid  8062]  0.000353 write(50, ">Tomcat Native\n
> "..., 3079) = 3079
> [pid  8062]  0.002071 getsockopt(50, SOL_SOCKET, SO_LINGER,
> {l_onoff=0, l_linger=0}, [8]) = 0
> [pid  8062]  0.000102 shutdown(50, SHUT_WR) = 0
> [pid  8068]  0.000342 close(50) = 0
>
> It is probably worth running a couple of quick tests to figure out the correct
> form of the strace command on your platform and then retesting.
>
> Mark
>

Entirely my fault. I'm new to strace, so I didn't know what to expect. I have 
now read the strace man page and I'm more up to speed. I tested it and we're 
now capturing file descriptor operations. The next batch of logs will be better.

-Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-18 Thread Mark Thomas

On 13/11/2020 23:46, Mark Thomas wrote:
> Eric sent me a copy of the strace (thanks Eric) and while it is
> consistent with what has already been observed, it didn't provide any
> new information on the socket / file descriptor being closed.
> 
> I'd like to suggest running again with the following:
> 
> sudo strace -r -f -e trace=network,desc -p 
> 
> That should log the file descriptor being closed (and other fd
> activity). There are a couple of things we might be able to do with this:
> 
> - we'll be able to determine if the socket is closed on the same or a
>   different thread
> - we might be able to correlate the time of closure with other logs
>   (seems unlikely as we have this from Wireshark but you never know)
> - the class before the close might be enlightening

Hi Eric,

I looked at the updated logs this morning. I don't see any additional
logging for file descriptors in the strace output.

I wonder if you need a slightly different command on your platform?

I'd expect to see entries like this:

[pid  8062]  0.70 openat(AT_FDCWD,
"/home/mark/repos/asf-tomcat-master/output/build/webapps/ROOT/bg-nav.png",
O_RDONLY) = 57
[pid  8062]  0.27 fstat(57,  
[pid  8062]  0.05 <... fstat resumed>{st_mode=S_IFREG|0664,
st_size=1401, ...}) = 0
[pid  8062]  0.43 read(57,  
[pid  8062]  0.33 <... read
resumed>"\211PNG\r\n\32\n\0\0\0\rIHDR\0\0\0\n\0\0\0002\10\6\0\0\0e\33J"...,
1401) = 1401
[pid  8062]  0.13 close(57 

showing file access although what I really want to see are the calls to
close the sockets (like the last two in the sequence below from a test
where I used telnet to perform an HTTP/1.0 request)

pid  8069]  0.124099 <... accept resumed>{sa_family=AF_INET6,
sin6_port=htons(52656), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
":::127.0.0.1", &sin6_addr), sin6_scope_id=0}, [28]) = 50
...
[pid  8063]  0.000216 read(50,  
[pid  8063]  0.58 <... read resumed>"GET / HTTP/1.0\r\n", 8192) = 16
[pid  8063]  0.29 read(50,  
[pid  8063]  0.30 <... read resumed>0x7f4f6c000e70, 8192) = -1
EAGAIN (Resource temporarily unavailable)
[pid  8064]  0.001061 read(50, "Host: a\r\n", 8192) = 9
[pid  8064]  0.000239 read(50, 0x7f4f6e70, 8192) = -1 EAGAIN
(Resource temporarily unavailable)
[pid  8062]  0.000214 read(50, "\r\n", 8192) = 2
[pid  8062]  0.007897 write(50, "HTTP/1.1 200 \r\nContent-Type:
tex"..., 8192) = 8192
[pid  8062]  0.000353 write(50, ">Tomcat Native\n
"..., 3079) = 3079
[pid  8062]  0.002071 getsockopt(50, SOL_SOCKET, SO_LINGER,
{l_onoff=0, l_linger=0}, [8]) = 0
[pid  8062]  0.000102 shutdown(50, SHUT_WR) = 0
[pid  8068]  0.000342 close(50) = 0

It is probably worth running a couple of quick tests to figure out the
correct form of the strace command on your platform and then retesting.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-15 Thread Eric Robinson





> -Original Message-
> From: Mark Thomas 
> Sent: Friday, November 13, 2020 5:47 PM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Eric sent me a copy of the strace (thanks Eric) and while it is consistent 
> with
> what has already been observed, it didn't provide any new information on
> the socket / file descriptor being closed.
>
> I'd like to suggest running again with the following:
>
> sudo strace -r -f -e trace=network,desc -p 
>
> That should log the file descriptor being closed (and other fd activity). 
> There
> are a couple of things we might be able to do with this:
>
> - we'll be able to determine if the socket is closed on the same or a
>   different thread
> - we might be able to correlate the time of closure with other logs
>   (seems unlikely as we have this from Wireshark but you never know)
> - the class before the close might be enlightening
>
> Mark

We will do that first thing Monday when the users start work. Hopefully it will 
turn up something!

-Eric

>
> On 13/11/2020 22:05, Paul Carter-Brown wrote:
> > lol, and there I was feeling ignored :-)
> >
> > That was the first thing I would have looked at. Is the OS reporting
> > errors to the JVM writing data or is the JVM not writing the data.
> > Strace will tell you this quite easily.
> >
> >
> > On Fri, Nov 13, 2020 at 5:27 PM Eric Robinson
> > 
> > wrote:
> >
> >>
> >>> -Original Message-
> >>> From: Paul Carter-Brown 
> >>> Sent: Friday, October 16, 2020 6:11 AM
> >>> To: Tomcat Users List 
> >>> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>>
> >>> Hi Eric,
> >>>
> >>> These weird situations are sometimes best looked at by confirming
> >>> what
> >> the
> >>> OS is seeing from user-space.
> >>>
> >>> Can you run: sudo strace -r -f -e trace=network -p 
> >>>
> >>> You can then log that to a file and correlate and see if the kernel
> >>> is
> >> in fact
> >>> being asked to send the response.
> >>>
> >>> It's very insightful to  see what is actually going on between the
> >>> JVM
> >> and
> >>> Kernel.
> >>>
> >>> Paul
> >>
> >> Paul, this message went to spam and I just found it!
> >>
> >> I will try this suggestion immediately.
> >>
> >> -Eric
> >>
> >>>
> >>> On Fri, Oct 16, 2020 at 12:16 PM Mark Thomas 
> wrote:
> >>>
> >>>> On 16/10/2020 10:05, Eric Robinson wrote:
> >>>>> Hi Mark --
> >>>>>
> >>>>> Those are great questions. See answers below.
> >>>>>
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Mark Thomas 
> >>>>>> Sent: Friday, October 16, 2020 2:20 AM
> >>>>>> To: users@tomcat.apache.org
> >>>>>> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>>>>>
> >>>>>> On 16/10/2020 00:27, Eric Robinson wrote:
> >>>>>>
> >>>>>> 
> >>>>>>
> >>>>>>> The localhost_access log shows a request received and an HTTP
> >>>>>>> 200
> >>>>>> response sent, as follows...
> >>>>>>>
> >>>>>>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> >>>>>>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> >>>>>> 15&multiResFacFi
> >>>>>>>
> >>>>>>
> >>>
> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztapppr
> >>>>>> oc
> >>>>>> e
> >>>>>>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063
> >>> HTTP/1.0
> >>>>>>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
> >>>>>> 15&multiResFacFilterId=0&sess
> >>>>>>>
> >>>>>>
> >>>
> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696
> >>>>>> &
> >>>>>> rn
> >>>>>>> d2=0.0715816×tamp=15102020125245.789063 200
> >>>>>>>
> >>>>>>> But WireShark

Re: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Mark Thomas

Eric sent me a copy of the strace (thanks Eric) and while it is
consistent with what has already been observed, it didn't provide any
new information on the socket / file descriptor being closed.

I'd like to suggest running again with the following:

sudo strace -r -f -e trace=network,desc -p 

That should log the file descriptor being closed (and other fd
activity). There are a couple of things we might be able to do with this:

- we'll be able to determine if the socket is closed on the same or a
  different thread
- we might be able to correlate the time of closure with other logs
  (seems unlikely as we have this from Wireshark but you never know)
- the class before the close might be enlightening

Mark

On 13/11/2020 22:05, Paul Carter-Brown wrote:
> lol, and there I was feeling ignored :-)
> 
> That was the first thing I would have looked at. Is the OS reporting errors
> to the JVM writing data or is the JVM not writing the data. Strace will
> tell you this quite easily.
> 
> 
> On Fri, Nov 13, 2020 at 5:27 PM Eric Robinson 
> wrote:
> 
>>
>>> -Original Message-
>>> From: Paul Carter-Brown 
>>> Sent: Friday, October 16, 2020 6:11 AM
>>> To: Tomcat Users List 
>>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>>
>>> Hi Eric,
>>>
>>> These weird situations are sometimes best looked at by confirming what
>> the
>>> OS is seeing from user-space.
>>>
>>> Can you run: sudo strace -r -f -e trace=network -p 
>>>
>>> You can then log that to a file and correlate and see if the kernel is
>> in fact
>>> being asked to send the response.
>>>
>>> It's very insightful to  see what is actually going on between the JVM
>> and
>>> Kernel.
>>>
>>> Paul
>>
>> Paul, this message went to spam and I just found it!
>>
>> I will try this suggestion immediately.
>>
>> -Eric
>>
>>>
>>> On Fri, Oct 16, 2020 at 12:16 PM Mark Thomas  wrote:
>>>
>>>> On 16/10/2020 10:05, Eric Robinson wrote:
>>>>> Hi Mark --
>>>>>
>>>>> Those are great questions. See answers below.
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: Mark Thomas 
>>>>>> Sent: Friday, October 16, 2020 2:20 AM
>>>>>> To: users@tomcat.apache.org
>>>>>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>>>>>
>>>>>> On 16/10/2020 00:27, Eric Robinson wrote:
>>>>>>
>>>>>> 
>>>>>>
>>>>>>> The localhost_access log shows a request received and an HTTP 200
>>>>>> response sent, as follows...
>>>>>>>
>>>>>>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
>>>>>>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
>>>>>> 15&multiResFacFi
>>>>>>>
>>>>>>
>>> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztapppr
>>>>>> oc
>>>>>> e
>>>>>>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063
>>> HTTP/1.0
>>>>>>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
>>>>>> 15&multiResFacFilterId=0&sess
>>>>>>>
>>>>>>
>>> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696
>>>>>> &
>>>>>> rn
>>>>>>> d2=0.0715816×tamp=15102020125245.789063 200
>>>>>>>
>>>>>>> But WireShark shows what really happened. The server received the
>>>>>>> GET
>>>>>> request, and then it sent a FIN to terminate the connection. So if
>>>> tomcat sent
>>>>>> an HTTP response, it did not make it out the Ethernet card.
>>>>>>>
>>>>>>> Is this the weirdest thing or what? Ideas would sure be
>> appreciated!
>>>>>>
>>>>>> I am assuming there is a typo in your Java version and you are
>>>>>> using
>>>> Java 8.
>>>>>>
>>>>>
>>>>> Yes, Java 8.
>>>>>
>>>>>> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in
>>>>>> less
>>>> than 6
>>>>>> months). If you aren't already planning to upgrade (I'd sug

Re: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Paul Carter-Brown

lol, and there I was feeling ignored :-)

That was the first thing I would have looked at. Is the OS reporting errors
to the JVM writing data or is the JVM not writing the data. Strace will
tell you this quite easily.


On Fri, Nov 13, 2020 at 5:27 PM Eric Robinson 
wrote:

>
> > -Original Message-
> > From: Paul Carter-Brown 
> > Sent: Friday, October 16, 2020 6:11 AM
> > To: Tomcat Users List 
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > Hi Eric,
> >
> > These weird situations are sometimes best looked at by confirming what
> the
> > OS is seeing from user-space.
> >
> > Can you run: sudo strace -r -f -e trace=network -p 
> >
> > You can then log that to a file and correlate and see if the kernel is
> in fact
> > being asked to send the response.
> >
> > It's very insightful to  see what is actually going on between the JVM
> and
> > Kernel.
> >
> > Paul
>
> Paul, this message went to spam and I just found it!
>
> I will try this suggestion immediately.
>
> -Eric
>
> >
> > On Fri, Oct 16, 2020 at 12:16 PM Mark Thomas  wrote:
> >
> > > On 16/10/2020 10:05, Eric Robinson wrote:
> > > > Hi Mark --
> > > >
> > > > Those are great questions. See answers below.
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Mark Thomas 
> > > >> Sent: Friday, October 16, 2020 2:20 AM
> > > >> To: users@tomcat.apache.org
> > > >> Subject: Re: Weirdest Tomcat Behavior Ever?
> > > >>
> > > >> On 16/10/2020 00:27, Eric Robinson wrote:
> > > >>
> > > >> 
> > > >>
> > > >>> The localhost_access log shows a request received and an HTTP 200
> > > >> response sent, as follows...
> > > >>>
> > > >>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> > > >>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> > > >> 15&multiResFacFi
> > > >>>
> > > >>
> > lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztapppr
> > > >> oc
> > > >> e
> > > >>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063
> > HTTP/1.0
> > > >>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
> > > >> 15&multiResFacFilterId=0&sess
> > > >>>
> > > >>
> > ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696
> > > >> &
> > > >> rn
> > > >>> d2=0.0715816×tamp=15102020125245.789063 200
> > > >>>
> > > >>> But WireShark shows what really happened. The server received the
> > > >>> GET
> > > >> request, and then it sent a FIN to terminate the connection. So if
> > > tomcat sent
> > > >> an HTTP response, it did not make it out the Ethernet card.
> > > >>>
> > > >>> Is this the weirdest thing or what? Ideas would sure be
> appreciated!
> > > >>
> > > >> I am assuming there is a typo in your Java version and you are
> > > >> using
> > > Java 8.
> > > >>
> > > >
> > > > Yes, Java 8.
> > > >
> > > >> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in
> > > >> less
> > > than 6
> > > >> months). If you aren't already planning to upgrade (I'd suggest to
> > > 9.0.x) then
> > > >> you might want to start thinking about it.
> > > >>
> > > >
> > > > Vendor constraint. It's a canned application published by a national
> > > software company, and they have not officially approved tomcat 8 for
> > > use on Linux yet.
> > > >
> > > >> I have a few ideas about what might be going on but rather than
> > > >> fire out random theories I have some questions that might help
> > > >> narrow things
> > > down.
> > > >>
> > > >> 1. If this request was successful, how big is the response?
> > > >>
> > > >
> > > > 1035 bytes.
> > > >
> > > >> 2. If this request was successful, how long would it typically take
> > > >> to complete?
> > > >>
> > > >
> > > > Under 60 ms

RE: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Eric Robinson

> From: Thomas Meyer 
> Sent: Friday, November 13, 2020 9:37 AM
> To: Tomcat Users List ; Mark Thomas
> ; users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
>
>
> Am 13. November 2020 10:06:18 MEZ schrieb Mark Thomas
> :
> >On 12/11/2020 14:19, Eric Robinson wrote:
> >>> From: Mark Thomas 
> >
> >
> >
> >>> I keep coming back to this. Something triggered this problem (note
> >that
> >>> trigger not necessarily the same as root cause). Given that the app,
> >Tomcat
> >>> and JVM versions didn't change that again points to some other
> >component.
> >>>
> >>
> >> Perfectly understandable. It's the oldest question in the diagnostic
> >playbook. What changed? I wish I had an answer. Whatever it was, if
> >impacted both upstream servers.
> >>
> >>> Picking just one of the wild ideas I've had is there some sort of
> >firewall, IDS,
> >>> IPS etc. that might be doing connection tracking and is, for some
> >reason,
> >>> getting it wrong and closing the connection in error?
> >>>
> >>
> >> Three is no firewall or IDS software running on the upstreams. The
> >only thing that comes to mind that may have been installed during that
> >timeframe is Sophos antivirus and Solar Winds RMM. Sophos was the first
> >thing I disabled when I saw the packet issues.
> >
> >ACK.
> >
> >>>>> The aim with this logging is to provide evidence of whether or not
> >>>>> there is a file descriptor handling problem in the JRE. My
> >>>>> expectation is that with these logs we will have reached the limit
> >of
> >>>>> what we can do with Tomcat but will be able to point you in the
> >right
> >>> direction for further investigation.
> >
> >I've had a chance to review these logs.
> >
> >To answer your (offlist) question about the HTTP/1.1 vs. HTTP/1.0 in
> >the Nginx logs I *think* the Nginx logs are showing that the request
> >received by Nginx is using HTTP/1.1.
> >
> >The logging does not indicate any issue with Java's handling of file
> >descriptors. The file descriptor associated with the socket where the
> >request fails is only observed to be associated with the socket where
> >the request fails. There is no indication that the file descriptor is
> >corrupted nor is there any indication that another thread tries to use
> >the same file descriptor.
> >
> >I dug a little into the exception where the write fails:
> >
> >java.net.SocketException: Bad file descriptor (Write failed)
> >at java.net.SocketOutputStream.socketWrite0(Native Method)
> >at
> >java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> >at
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> >at
> >org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEnd
> point.java:1491)
> >at
> >org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOu
> tputBuffer.java:247)
> >at
> >org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> >at
> >org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutp
> ut
> >Buffer.java:183)
> >...
> >
> >
> >I took a look at the JRE source code. That exception is triggered by an
> >OS level error (9, EBADF, "Bad file descriptor") when the JRE makes the
> >OS call to write to the socket.
> >
> >Everything I have found online points to one of two causes for such an
> >error:
> >a) the socket has already been closed
> >b) the OS has run out of file descriptors
>
> Was it mentioned what OS is used? What Linux kernel version?
> Are any security modules like SELinux or similar is in use?
> It's maybe possible that a tracepoint exists that can be activated to get 
> better
> understanding when the OS closes the socket.
>

Way back at the beginning of the thread. 😊

CentOS Linux release 7.8.2003 (Core)

[root@001app01a ~]# uname -a
Linux 001app01a.ccnva.local 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 
17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[root@001app01a ~]# sestatus
SELinux status: enabled
SELinuxfs mount:/sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode:   permissive
Mode from config file:  permissive
Policy MLS status:  enabled
Policy deny_unknown status: allowed
Max kernel policy version:  31


> &g

Re: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Thomas Meyer




Am 13. November 2020 10:06:18 MEZ schrieb Mark Thomas :
>On 12/11/2020 14:19, Eric Robinson wrote:
>>> From: Mark Thomas 
>
>
>
>>> I keep coming back to this. Something triggered this problem (note
>that
>>> trigger not necessarily the same as root cause). Given that the app,
>Tomcat
>>> and JVM versions didn't change that again points to some other
>component.
>>>
>> 
>> Perfectly understandable. It's the oldest question in the diagnostic
>playbook. What changed? I wish I had an answer. Whatever it was, if
>impacted both upstream servers.
>> 
>>> Picking just one of the wild ideas I've had is there some sort of
>firewall, IDS,
>>> IPS etc. that might be doing connection tracking and is, for some
>reason,
>>> getting it wrong and closing the connection in error?
>>>
>> 
>> Three is no firewall or IDS software running on the upstreams. The
>only thing that comes to mind that may have been installed during that
>timeframe is Sophos antivirus and Solar Winds RMM. Sophos was the first
>thing I disabled when I saw the packet issues.
>
>ACK.
>
> The aim with this logging is to provide evidence of whether or not
> there is a file descriptor handling problem in the JRE. My
> expectation is that with these logs we will have reached the limit
>of
> what we can do with Tomcat but will be able to point you in the
>right
>>> direction for further investigation.
>
>I've had a chance to review these logs.
>
>To answer your (offlist) question about the HTTP/1.1 vs. HTTP/1.0 in
>the
>Nginx logs I *think* the Nginx logs are showing that the request
>received by Nginx is using HTTP/1.1.
>
>The logging does not indicate any issue with Java's handling of file
>descriptors. The file descriptor associated with the socket where the
>request fails is only observed to be associated with the socket where
>the request fails. There is no indication that the file descriptor is
>corrupted nor is there any indication that another thread tries to use
>the same file descriptor.
>
>I dug a little into the exception where the write fails:
>
>java.net.SocketException: Bad file descriptor (Write failed)
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at
>java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>   at
>org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndpoint.java:1491)
>   at
>org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:247)
>   at
>org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
>   at
>org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutputBuffer.java:183)
>...
>
>
>I took a look at the JRE source code. That exception is triggered by an
>OS level error (9, EBADF, "Bad file descriptor") when the JRE makes the
>OS call to write to the socket.
>
>Everything I have found online points to one of two causes for such an
>error:
>a) the socket has already been closed
>b) the OS has run out of file descriptors

Was it mentioned what OS is used? What Linux kernel version?
Are any security modules like SELinux or similar is in use?
It's maybe possible that a tracepoint exists that can be activated to get 
better understanding when the OS closes the socket.

>
>There is no indication that the JRE or Tomcat or the application is
>doing a)
>Previous investigations have ruled out b)
>
>The wireshark trace indicates that the socket is closed before the
>write
>takes place which suggests a) rather more than b). Even so, I'd be
>tempted to double check b) and maybe try running Tomcat with
>-XX:+MaxFDLimit just to be sure.
>
>If you haven't already, I think now is the time to follow Paul
>Carter-Brown's advice from earlier in this thread and use strace to see
>what is going on between the JRE and the OS. The aim being to answer
>the
>question "what is triggering the socket close"
>
>I can try and help interpret that log but I am far from an expert. You
>may want to seek help elsewhere.
>
>Mark
>
>-
>To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>For additional commands, e-mail: users-h...@tomcat.apache.org

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Eric Robinson


> -Original Message-
> From: Paul Carter-Brown 
> Sent: Friday, October 16, 2020 6:11 AM
> To: Tomcat Users List 
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Hi Eric,
>
> These weird situations are sometimes best looked at by confirming what the
> OS is seeing from user-space.
>
> Can you run: sudo strace -r -f -e trace=network -p 
>
> You can then log that to a file and correlate and see if the kernel is in fact
> being asked to send the response.
>
> It's very insightful to  see what is actually going on between the JVM and
> Kernel.
>
> Paul

Paul, this message went to spam and I just found it!

I will try this suggestion immediately.

-Eric

>
> On Fri, Oct 16, 2020 at 12:16 PM Mark Thomas  wrote:
>
> > On 16/10/2020 10:05, Eric Robinson wrote:
> > > Hi Mark --
> > >
> > > Those are great questions. See answers below.
> > >
> > >
> > >> -Original Message-
> > >> From: Mark Thomas 
> > >> Sent: Friday, October 16, 2020 2:20 AM
> > >> To: users@tomcat.apache.org
> > >> Subject: Re: Weirdest Tomcat Behavior Ever?
> > >>
> > >> On 16/10/2020 00:27, Eric Robinson wrote:
> > >>
> > >> 
> > >>
> > >>> The localhost_access log shows a request received and an HTTP 200
> > >> response sent, as follows...
> > >>>
> > >>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> > >>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> > >> 15&multiResFacFi
> > >>>
> > >>
> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztapppr
> > >> oc
> > >> e
> > >>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063
> HTTP/1.0
> > >>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
> > >> 15&multiResFacFilterId=0&sess
> > >>>
> > >>
> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696
> > >> &
> > >> rn
> > >>> d2=0.0715816×tamp=15102020125245.789063 200
> > >>>
> > >>> But WireShark shows what really happened. The server received the
> > >>> GET
> > >> request, and then it sent a FIN to terminate the connection. So if
> > tomcat sent
> > >> an HTTP response, it did not make it out the Ethernet card.
> > >>>
> > >>> Is this the weirdest thing or what? Ideas would sure be appreciated!
> > >>
> > >> I am assuming there is a typo in your Java version and you are
> > >> using
> > Java 8.
> > >>
> > >
> > > Yes, Java 8.
> > >
> > >> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in
> > >> less
> > than 6
> > >> months). If you aren't already planning to upgrade (I'd suggest to
> > 9.0.x) then
> > >> you might want to start thinking about it.
> > >>
> > >
> > > Vendor constraint. It's a canned application published by a national
> > software company, and they have not officially approved tomcat 8 for
> > use on Linux yet.
> > >
> > >> I have a few ideas about what might be going on but rather than
> > >> fire out random theories I have some questions that might help
> > >> narrow things
> > down.
> > >>
> > >> 1. If this request was successful, how big is the response?
> > >>
> > >
> > > 1035 bytes.
> > >
> > >> 2. If this request was successful, how long would it typically take
> > >> to complete?
> > >>
> > >
> > > Under 60 ms.
> > >
> > >> 3. Looking at the Wireshark trace for a failed request, how long
> > >> after
> > the last
> > >> byte of the request is sent by the client does Tomcat send the FIN?
> > >>
> > >
> > > Maybe 100 microseconds.
> > >
> > >> 4. Looking at the Wireshark trace for a failed request, is the
> > >> request
> > fully sent
> > >> (including terminating CRLF etc)?
> > >>
> > >
> > > Yes, the request as seen by the tomcat server is complete and is
> > terminated by 0D 0A.
> > >
> > >> 5. Are there any proxies, firewalls etc between the user agent and
> > Tomcat?
> > >>
> > >
> > > User agent -> firewal

RE: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, November 13, 2020 3:06 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 12/11/2020 14:19, Eric Robinson wrote:
> >> From: Mark Thomas 
>
> 
>
> >> I keep coming back to this. Something triggered this problem (note
> >> that trigger not necessarily the same as root cause). Given that the
> >> app, Tomcat and JVM versions didn't change that again points to some
> other component.
> >>
> >
> > Perfectly understandable. It's the oldest question in the diagnostic
> playbook. What changed? I wish I had an answer. Whatever it was, if
> impacted both upstream servers.
> >
> >> Picking just one of the wild ideas I've had is there some sort of
> >> firewall, IDS, IPS etc. that might be doing connection tracking and
> >> is, for some reason, getting it wrong and closing the connection in error?
> >>
> >
> > Three is no firewall or IDS software running on the upstreams. The only
> thing that comes to mind that may have been installed during that timeframe
> is Sophos antivirus and Solar Winds RMM. Sophos was the first thing I
> disabled when I saw the packet issues.
>
> ACK.
>
> >>>> The aim with this logging is to provide evidence of whether or not
> >>>> there is a file descriptor handling problem in the JRE. My
> >>>> expectation is that with these logs we will have reached the limit
> >>>> of what we can do with Tomcat but will be able to point you in the
> >>>> right
> >> direction for further investigation.
>
> I've had a chance to review these logs.
>
> To answer your (offlist) question about the HTTP/1.1 vs. HTTP/1.0 in the
> Nginx logs I *think* the Nginx logs are showing that the request received by
> Nginx is using HTTP/1.1.
>
> The logging does not indicate any issue with Java's handling of file
> descriptors. The file descriptor associated with the socket where the request
> fails is only observed to be associated with the socket where the request
> fails. There is no indication that the file descriptor is corrupted nor is 
> there
> any indication that another thread tries to use the same file descriptor.
>
> I dug a little into the exception where the write fails:
>
> java.net.SocketException: Bad file descriptor (Write failed)
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> at
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> at
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndp
> oint.java:1491)
> at
> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut
> putBuffer.java:247)
> at
> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> at
> org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutput
> Buffer.java:183)
> ...
>
>
> I took a look at the JRE source code. That exception is triggered by an OS 
> level
> error (9, EBADF, "Bad file descriptor") when the JRE makes the OS call to
> write to the socket.
>
> Everything I have found online points to one of two causes for such an
> error:
> a) the socket has already been closed
> b) the OS has run out of file descriptors
>
> There is no indication that the JRE or Tomcat or the application is doing a)
> Previous investigations have ruled out b)
>
> The wireshark trace indicates that the socket is closed before the write takes
> place which suggests a) rather more than b). Even so, I'd be tempted to
> double check b) and maybe try running Tomcat with -XX:+MaxFDLimit just to
> be sure.
>
> If you haven't already, I think now is the time to follow Paul Carter-Brown's
> advice from earlier in this thread and use strace to see what is going on
> between the JRE and the OS. The aim being to answer the question "what is
> triggering the socket close"
>

This is the second time you alluded to comments from a someone I haven't seen 
in the thread . I just checked my spam folder and found that, for some unknown 
reason, 4 messages in this long thread went to spam. They were from Paul 
Carter-Brown, Konstantin Kolinko, and Daniel Skiles. They probably thought I 
ignored them.  Now I'll go check out their recommendations.

> I can try and help interpret that log but I am far from an expert. You may
> want to seek help elsewhere.
>
> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
&

Re: Weirdest Tomcat Behavior Ever?

2020-11-13 Thread Mark Thomas

On 12/11/2020 14:19, Eric Robinson wrote:
>> From: Mark Thomas 

>> I keep coming back to this. Something triggered this problem (note that
>> trigger not necessarily the same as root cause). Given that the app, Tomcat
>> and JVM versions didn't change that again points to some other component.
>>
> 
> Perfectly understandable. It's the oldest question in the diagnostic 
> playbook. What changed? I wish I had an answer. Whatever it was, if impacted 
> both upstream servers.
> 
>> Picking just one of the wild ideas I've had is there some sort of firewall, 
>> IDS,
>> IPS etc. that might be doing connection tracking and is, for some reason,
>> getting it wrong and closing the connection in error?
>>
> 
> Three is no firewall or IDS software running on the upstreams. The only thing 
> that comes to mind that may have been installed during that timeframe is 
> Sophos antivirus and Solar Winds RMM. Sophos was the first thing I disabled 
> when I saw the packet issues.

ACK.

 The aim with this logging is to provide evidence of whether or not
 there is a file descriptor handling problem in the JRE. My
 expectation is that with these logs we will have reached the limit of
 what we can do with Tomcat but will be able to point you in the right
>> direction for further investigation.

I've had a chance to review these logs.

To answer your (offlist) question about the HTTP/1.1 vs. HTTP/1.0 in the
Nginx logs I *think* the Nginx logs are showing that the request
received by Nginx is using HTTP/1.1.

The logging does not indicate any issue with Java's handling of file
descriptors. The file descriptor associated with the socket where the
request fails is only observed to be associated with the socket where
the request fails. There is no indication that the file descriptor is
corrupted nor is there any indication that another thread tries to use
the same file descriptor.

I dug a little into the exception where the write fails:

java.net.SocketException: Bad file descriptor (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndpoint.java:1491)
at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:247)
at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
at
org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutputBuffer.java:183)
...

I took a look at the JRE source code. That exception is triggered by an
OS level error (9, EBADF, "Bad file descriptor") when the JRE makes the
OS call to write to the socket.

Everything I have found online points to one of two causes for such an
error:
a) the socket has already been closed
b) the OS has run out of file descriptors

There is no indication that the JRE or Tomcat or the application is doing a)
Previous investigations have ruled out b)

The wireshark trace indicates that the socket is closed before the write
takes place which suggests a) rather more than b). Even so, I'd be
tempted to double check b) and maybe try running Tomcat with
-XX:+MaxFDLimit just to be sure.

If you haven't already, I think now is the time to follow Paul
Carter-Brown's advice from earlier in this thread and use strace to see
what is going on between the JRE and the OS. The aim being to answer the
question "what is triggering the socket close"

I can try and help interpret that log but I am far from an expert. You
may want to seek help elsewhere.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-12 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Thursday, November 12, 2020 4:08 AM
> To: Tomcat Users List ; Eric Robinson
> 
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 11/11/2020 22:48, Eric Robinson wrote:
> >> -Original Message-
> >> From: Mark Thomas 
> >> Sent: Monday, November 9, 2020 5:59 AM
> >> To: users@tomcat.apache.org
> >> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>
> >> Eric,
> >>
> >> Time to prune the history and provide another summary I think. This
> >> summary isn't complete. There is more information in the history of
> >> the thread. I'm trying to focus on what seems to be the key information.
> >>
> >
> > Hi Mark -- So sorry for going silent for a couple of days. Our organization 
> > is
> neck-deep in a huge compliance project. Combine that with this issue we're
> working on together, and it's Perfect Storm time around here. We have a big
> meeting with the client and vendor tomorrow about all this and I'm working
> like heck to prevent this important customer from jumping ship.
>
> Understood. Let me know if there is anything I can do to help.
>
> > Now back to it!
> >
> >>
> >> Overview:
> >> A small number of requests are receiving a completely empty (no
> >> headers, no body) response.
> >>
> >
> > Just a FIN packet and that's all.
>
> Agreed.
>
> >> Environment
> >> Tomcat 7.0.72
> >>  - BIO HTTP (issue also observed with NIO)
> >>  - Source unknown (probably ASF)
> >> Java 1.8.0_221, Oracle
> >> CentOS 7.5, Azure
> >> Nginx reverse proxy
> >>  - Using HTTP/1.0
> >>  - No keep-alive
> >>  - No compression
> >> No (known) environment changes in the time period where this issue
> >> started
>
> I keep coming back to this. Something triggered this problem (note that
> trigger not necessarily the same as root cause). Given that the app, Tomcat
> and JVM versions didn't change that again points to some other component.
>

Perfectly understandable. It's the oldest question in the diagnostic playbook. 
What changed? I wish I had an answer. Whatever it was, if impacted both 
upstream servers.

> Picking just one of the wild ideas I've had is there some sort of firewall, 
> IDS,
> IPS etc. that might be doing connection tracking and is, for some reason,
> getting it wrong and closing the connection in error?
>

Three is no firewall or IDS software running on the upstreams. The only thing 
that comes to mind that may have been installed during that timeframe is Sophos 
antivirus and Solar Winds RMM. Sophos was the first thing I disabled when I saw 
the packet issues.

> As an aside, I mentioned earlier in this thread a similar issue we have been
> observing in the CI system. I tracked that down yesterday and I am certain
> the issues are unrelated. The CI issue was NIO specific (we see this issue 
> with
> BIO and NIO) and introduced by refactoring in 8.5.x (we see this issue in
> 7.0.x). Sorry this doesn't help.
>
> >> Results from debug logging
> >> - The request is read without error
> >> - The connection close is initiated from the Tomcat/Java side
> >> - The socket is closed before Tomcat tries to write the response
> >> - The application is not triggering the close of the socket
> >> - Tomcat is not triggering the close of the socket
> >> - When Tomcat does try and write we see the following exception
> >> java.net.SocketException: Bad file descriptor (Write failed)
> >>
> >> We have confirmed that the Java process is not hitting the limit for
> >> file descriptors.
> >>
> >> The file descriptor must have been valid when the request was read
> >> from the socket.
> >>
> >> The first debug log shows 2 other active connections from Nginx to
> >> Tomcat at the point the connection is closed unexpectedly.
> >>
> >> The second debug log shows 1 other active connection from Nginx to
> >> Tomcat at the point the connection is closed unexpectedly.
> >>
> >> The third debug log shows 1 other active connection from Nginx to
> >> Tomcat at the point the connection is closed unexpectedly.
> >>
> >> The fourth debug log shows no other active connection from Nginx to
> >> Tomcat at the point the connection is closed unexpectedly.
> >>
> >>
> >> Analysis
> >>
> >> We know the connection close isn't coming from Tomcat or the
>

Re: Weirdest Tomcat Behavior Ever?

2020-11-12 Thread Mark Thomas

On 11/11/2020 22:48, Eric Robinson wrote:
>> -Original Message-
>> From: Mark Thomas 
>> Sent: Monday, November 9, 2020 5:59 AM
>> To: users@tomcat.apache.org
>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>
>> Eric,
>>
>> Time to prune the history and provide another summary I think. This
>> summary isn't complete. There is more information in the history of the
>> thread. I'm trying to focus on what seems to be the key information.
>>
> 
> Hi Mark -- So sorry for going silent for a couple of days. Our organization 
> is neck-deep in a huge compliance project. Combine that with this issue we're 
> working on together, and it's Perfect Storm time around here. We have a big 
> meeting with the client and vendor tomorrow about all this and I'm working 
> like heck to prevent this important customer from jumping ship.

Understood. Let me know if there is anything I can do to help.

> Now back to it!
> 
>>
>> Overview:
>> A small number of requests are receiving a completely empty (no headers,
>> no body) response.
>>
> 
> Just a FIN packet and that's all.

Agreed.

>> Environment
>> Tomcat 7.0.72
>>  - BIO HTTP (issue also observed with NIO)
>>  - Source unknown (probably ASF)
>> Java 1.8.0_221, Oracle
>> CentOS 7.5, Azure
>> Nginx reverse proxy
>>  - Using HTTP/1.0
>>  - No keep-alive
>>  - No compression
>> No (known) environment changes in the time period where this issue started

I keep coming back to this. Something triggered this problem (note that
trigger not necessarily the same as root cause). Given that the app,
Tomcat and JVM versions didn't change that again points to some other
component.

Picking just one of the wild ideas I've had is there some sort of
firewall, IDS, IPS etc. that might be doing connection tracking and is,
for some reason, getting it wrong and closing the connection in error?

As an aside, I mentioned earlier in this thread a similar issue we have
been observing in the CI system. I tracked that down yesterday and I am
certain the issues are unrelated. The CI issue was NIO specific (we see
this issue with BIO and NIO) and introduced by refactoring in 8.5.x (we
see this issue in 7.0.x). Sorry this doesn't help.

>> Results from debug logging
>> - The request is read without error
>> - The connection close is initiated from the Tomcat/Java side
>> - The socket is closed before Tomcat tries to write the response
>> - The application is not triggering the close of the socket
>> - Tomcat is not triggering the close of the socket
>> - When Tomcat does try and write we see the following exception
>> java.net.SocketException: Bad file descriptor (Write failed)
>>
>> We have confirmed that the Java process is not hitting the limit for file
>> descriptors.
>>
>> The file descriptor must have been valid when the request was read from
>> the socket.
>>
>> The first debug log shows 2 other active connections from Nginx to Tomcat at
>> the point the connection is closed unexpectedly.
>>
>> The second debug log shows 1 other active connection from Nginx to Tomcat
>> at the point the connection is closed unexpectedly.
>>
>> The third debug log shows 1 other active connection from Nginx to Tomcat at
>> the point the connection is closed unexpectedly.
>>
>> The fourth debug log shows no other active connection from Nginx to
>> Tomcat at the point the connection is closed unexpectedly.
>>
>>
>> Analysis
>>
>> We know the connection close isn't coming from Tomcat or the application.
>> That leaves:
>> - the JVM
>> - the OS
>> - the virtualisation layer (since this is Azure I am assuming there is
>>   one)
>>
>> We are approaching the limit of what we can debug via Tomcat (and my area
>> of expertise. The evidence so far is pointing to an issue lower down the
>> network stack (JVM, OS or virtualisation layer).
>>
> 
> Can't disagree with you there.
> 
>> I think the next, and possibly last, thing we can do from Tomcat is log some
>> information on the file descriptor associated with the socket. That is going 
>> to
>> require some reflection to read JVM internals.
>>
>> Patch files here:
>> http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v4/
>>
>> Source code here:
>> https://github.com/markt-asf/tomcat/tree/debug-7.0.72
>>
> 
> I will apply these tonight.
>
>> The file descriptor usage count is guarded by a lock object so this patch 
>> adds
>> quite a few s

RE: Weirdest Tomcat Behavior Ever?

2020-11-11 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Monday, November 9, 2020 5:59 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Eric,
>
> Time to prune the history and provide another summary I think. This
> summary isn't complete. There is more information in the history of the
> thread. I'm trying to focus on what seems to be the key information.
>

Hi Mark -- So sorry for going silent for a couple of days. Our organization is 
neck-deep in a huge compliance project. Combine that with this issue we're 
working on together, and it's Perfect Storm time around here. We have a big 
meeting with the client and vendor tomorrow about all this and I'm working like 
heck to prevent this important customer from jumping ship.

Now back to it!

>
> Overview:
> A small number of requests are receiving a completely empty (no headers,
> no body) response.
>

Just a FIN packet and that's all.

> Environment
> Tomcat 7.0.72
>  - BIO HTTP (issue also observed with NIO)
>  - Source unknown (probably ASF)
> Java 1.8.0_221, Oracle
> CentOS 7.5, Azure
> Nginx reverse proxy
>  - Using HTTP/1.0
>  - No keep-alive
>  - No compression
> No (known) environment changes in the time period where this issue started
>
> Results from debug logging
> - The request is read without error
> - The connection close is initiated from the Tomcat/Java side
> - The socket is closed before Tomcat tries to write the response
> - The application is not triggering the close of the socket
> - Tomcat is not triggering the close of the socket
> - When Tomcat does try and write we see the following exception
> java.net.SocketException: Bad file descriptor (Write failed)
>
> We have confirmed that the Java process is not hitting the limit for file
> descriptors.
>
> The file descriptor must have been valid when the request was read from
> the socket.
>
> The first debug log shows 2 other active connections from Nginx to Tomcat at
> the point the connection is closed unexpectedly.
>
> The second debug log shows 1 other active connection from Nginx to Tomcat
> at the point the connection is closed unexpectedly.
>
> The third debug log shows 1 other active connection from Nginx to Tomcat at
> the point the connection is closed unexpectedly.
>
> The fourth debug log shows no other active connection from Nginx to
> Tomcat at the point the connection is closed unexpectedly.
>
>
> Analysis
>
> We know the connection close isn't coming from Tomcat or the application.
> That leaves:
> - the JVM
> - the OS
> - the virtualisation layer (since this is Azure I am assuming there is
>   one)
>
> We are approaching the limit of what we can debug via Tomcat (and my area
> of expertise. The evidence so far is pointing to an issue lower down the
> network stack (JVM, OS or virtualisation layer).
>

Can't disagree with you there.

> I think the next, and possibly last, thing we can do from Tomcat is log some
> information on the file descriptor associated with the socket. That is going 
> to
> require some reflection to read JVM internals.
>
> Patch files here:
> http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v4/
>
> Source code here:
> https://github.com/markt-asf/tomcat/tree/debug-7.0.72
>

I will apply these tonight.

> The file descriptor usage count is guarded by a lock object so this patch adds
> quite a few syncs. For the load you are seeing that shouldn't an issue but
> there is a change it will impact performance.
>

Based on observation of load, I'm not too concerned about that. Maybe a little. 
I'll keep an eye on it.

> The aim with this logging is to provide evidence of whether or not there is a
> file descriptor handling problem in the JRE. My expectation is that with these
> logs we will have reached the limit of what we can do with Tomcat but will be
> able to point you in the right direction for further investigation.
>

I'll get this done right away.

> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-09 Thread Mark Thomas

Eric,

Time to prune the history and provide another summary I think. This
summary isn't complete. There is more information in the history of the
thread. I'm trying to focus on what seems to be the key information.


Overview:
A small number of requests are receiving a completely empty (no headers,
no body) response.

Environment
Tomcat 7.0.72
 - BIO HTTP (issue also observed with NIO)
 - Source unknown (probably ASF)
Java 1.8.0_221, Oracle
CentOS 7.5, Azure
Nginx reverse proxy
 - Using HTTP/1.0
 - No keep-alive
 - No compression
No (known) environment changes in the time period where this issue started

Results from debug logging
- The request is read without error
- The connection close is initiated from the Tomcat/Java side
- The socket is closed before Tomcat tries to write the response
- The application is not triggering the close of the socket
- Tomcat is not triggering the close of the socket
- When Tomcat does try and write we see the following exception
java.net.SocketException: Bad file descriptor (Write failed)

We have confirmed that the Java process is not hitting the limit for
file descriptors.

The file descriptor must have been valid when the request was read from
the socket.

The first debug log shows 2 other active connections from Nginx to
Tomcat at the point the connection is closed unexpectedly.

The second debug log shows 1 other active connection from Nginx to
Tomcat at the point the connection is closed unexpectedly.

The third debug log shows 1 other active connection from Nginx to Tomcat
at the point the connection is closed unexpectedly.

The fourth debug log shows no other active connection from Nginx to
Tomcat at the point the connection is closed unexpectedly.


Analysis

We know the connection close isn't coming from Tomcat or the
application. That leaves:
- the JVM
- the OS
- the virtualisation layer (since this is Azure I am assuming there is
  one)

We are approaching the limit of what we can debug via Tomcat (and my
area of expertise. The evidence so far is pointing to an issue lower
down the network stack (JVM, OS or virtualisation layer).

I think the next, and possibly last, thing we can do from Tomcat is log
some information on the file descriptor associated with the socket. That
is going to require some reflection to read JVM internals.

Patch files here:
http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v4/

Source code here:
https://github.com/markt-asf/tomcat/tree/debug-7.0.72

The file descriptor usage count is guarded by a lock object so this
patch adds quite a few syncs. For the load you are seeing that shouldn't
an issue but there is a change it will impact performance.

The aim with this logging is to provide evidence of whether or not there
is a file descriptor handling problem in the JRE. My expectation is that
with these logs we will have reached the limit of what we can do with
Tomcat but will be able to point you in the right direction for further
investigation.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-06 Thread Eric Robinson

> > -Original Message-
> > From: Stefan Mayr 
> > Sent: Thursday, November 5, 2020 4:24 PM
> > To: users@tomcat.apache.org
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > Am 03.11.2020 um 16:05 schrieb Eric Robinson:
> > >> -Original Message-
> > >> From: Eric Robinson 
> > >> Sent: Tuesday, November 3, 2020 8:21 AM
> > >> To: Tomcat Users List 
> > >> Subject: RE: Weirdest Tomcat Behavior Ever?
> > >>
> > >>> From: Mark Thomas 
> > >>> Sent: Tuesday, November 3, 2020 2:06 AM
> > >>> To: Tomcat Users List 
> > >>> Subject: Re: Weirdest Tomcat Behavior Ever?
> > >>>
> > >>> On 02/11/2020 12:16, Eric Robinson wrote:
> > >>>
> > >>> 
> > >>>
> > >>>> Gotcha, thanks for the clarification. Let's see what happens when
> > >>>> the users
> > >>> start hitting it at 8:00 am Eastern.
> > >>>
> > >>> Progress. The first attempt to write to the socket triggers the
> > >>> following
> > >>> exception:
> > >>>
> > >>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> > >>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> > >>> [301361476]
> > >>>  java.net.SocketException: Bad file descriptor (Write failed)
> > >>> at java.net.SocketOutputStream.socketWrite0(Native Method)
> > >>> at
> > >>>
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> > >>> at
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> > >>> at
> > >>>
> > ...
> >
> > >>> Because this is an instance of an IOException, Tomcat assumes it
> > >>> has been caused by the client dropping the connection and silently
> > >>> swallows it. I'll be changing that later today so the exception is
> > >>> logged as DEBUG level for all new Tomcat versions.
> > >>>
> > >>> Possible causes of "java.net.SocketException: Bad file descriptor"
> > >>> I've been able to find are:
> > >>>
> > >>> 1. OS running out of file descriptors.
> > >>>
> > >>> 2.Trying to use a closed socket.
> > >>>
> > >>> I want to review the source code to see if there are any others.
> > >>>
> > >>> I don't think we are seeing 2 as there is no indication of the
> > >>> Socket, InputStream or OutputStream being closed in the logs.
> > >>>
> > >>> That leaves 1. Possible causes here are a file descriptor leak or
> > >>> normal operations occasionally needing more than the current limit.
> > >>> I don't think it is a leak as I'd expect to see many more errors
> > >>> of this type after the first and we aren't seeing that. That
> > >>> leaves the possibility of the current limit being a little too low.
> > >>>
> > >>> My recommendation at this point is to increase the limit for file
> > descriptors.
> > >>> Meanwhile, I'll look at the JRE source to see if there are any
> > >>> other possible triggers for this exception.
> > >>>
> > >>> Mark
> > >>>
> > >>>
> > >>
> > >> On the tomcat server, max open file descriptors is currently
> > >> 2853957
> > >>
> > >> [root@001app01a ~]# sysctl fs.file-max fs.file-max = 2853957
> > >>
> > >> Most of the time, the number of open files appears to run about
> 600,000.
> > >>
> > >>  What do you think of watching the open file count and seeing if
> > >> the number gets up around the ceiling when the socket write failure
> > >> occurs? Something like...
> > >>
> > >> [root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo
> > >> "$(date
> > >> +%H:%M:%S) $FILES";done
> > >> 09:11:15 591671
> > >> 09:11:35 627347
> > >> 09:11:54 626851
> > >> 09:12:11 626429
> > >> 09:12:26 545748
> > >> 09:12:42 548578
> > >> 09:12:58 551487
> > >> 09:13:14 516700
> > >> 09:13:30 513312
> > >> 09:13:45 512830
> > >> 09:14

RE: Weirdest Tomcat Behavior Ever?

2020-11-06 Thread Eric Robinson

> -Original Message-
> From: Stefan Mayr 
> Sent: Thursday, November 5, 2020 4:24 PM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Am 03.11.2020 um 16:05 schrieb Eric Robinson:
> >> -Original Message-
> >> From: Eric Robinson 
> >> Sent: Tuesday, November 3, 2020 8:21 AM
> >> To: Tomcat Users List 
> >> Subject: RE: Weirdest Tomcat Behavior Ever?
> >>
> >>> From: Mark Thomas 
> >>> Sent: Tuesday, November 3, 2020 2:06 AM
> >>> To: Tomcat Users List 
> >>> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>>
> >>> On 02/11/2020 12:16, Eric Robinson wrote:
> >>>
> >>> 
> >>>
> >>>> Gotcha, thanks for the clarification. Let's see what happens when
> >>>> the users
> >>> start hitting it at 8:00 am Eastern.
> >>>
> >>> Progress. The first attempt to write to the socket triggers the
> >>> following
> >>> exception:
> >>>
> >>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> >>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> >>> [301361476]
> >>>  java.net.SocketException: Bad file descriptor (Write failed)
> >>> at java.net.SocketOutputStream.socketWrite0(Native Method)
> >>> at
> >>>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> >>> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> >>> at
> >>>
> ...
>
> >>> Because this is an instance of an IOException, Tomcat assumes it has
> >>> been caused by the client dropping the connection and silently
> >>> swallows it. I'll be changing that later today so the exception is
> >>> logged as DEBUG level for all new Tomcat versions.
> >>>
> >>> Possible causes of "java.net.SocketException: Bad file descriptor"
> >>> I've been able to find are:
> >>>
> >>> 1. OS running out of file descriptors.
> >>>
> >>> 2.Trying to use a closed socket.
> >>>
> >>> I want to review the source code to see if there are any others.
> >>>
> >>> I don't think we are seeing 2 as there is no indication of the
> >>> Socket, InputStream or OutputStream being closed in the logs.
> >>>
> >>> That leaves 1. Possible causes here are a file descriptor leak or
> >>> normal operations occasionally needing more than the current limit.
> >>> I don't think it is a leak as I'd expect to see many more errors of
> >>> this type after the first and we aren't seeing that. That leaves the
> >>> possibility of the current limit being a little too low.
> >>>
> >>> My recommendation at this point is to increase the limit for file
> descriptors.
> >>> Meanwhile, I'll look at the JRE source to see if there are any other
> >>> possible triggers for this exception.
> >>>
> >>> Mark
> >>>
> >>>
> >>
> >> On the tomcat server, max open file descriptors is currently 2853957
> >>
> >> [root@001app01a ~]# sysctl fs.file-max fs.file-max = 2853957
> >>
> >> Most of the time, the number of open files appears to run about 600,000.
> >>
> >>  What do you think of watching the open file count and seeing if the
> >> number gets up around the ceiling when the socket write failure
> >> occurs? Something like...
> >>
> >> [root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo
> >> "$(date
> >> +%H:%M:%S) $FILES";done
> >> 09:11:15 591671
> >> 09:11:35 627347
> >> 09:11:54 626851
> >> 09:12:11 626429
> >> 09:12:26 545748
> >> 09:12:42 548578
> >> 09:12:58 551487
> >> 09:13:14 516700
> >> 09:13:30 513312
> >> 09:13:45 512830
> >> 09:14:02 58
> >> 09:14:18 568233
> >> 09:14:35 570158
> >> 09:14:51 566269
> >> 09:15:07 547389
> >> 09:15:23 544203
> >> 09:15:38 546395
> >>
> >> It's not ideal; as it seems to take 15-20 seconds to count them using lsof.
> >>
> >>
> >>
> >
> > Wait, never mind. I realized the per-process limits are what matters. I
> checked, and nofile was set to 4096 for the relevant java process.
> >
> > I

Re: Weirdest Tomcat Behavior Ever?

2020-11-05 Thread Stefan Mayr

Am 03.11.2020 um 16:05 schrieb Eric Robinson:
>> -Original Message-
>> From: Eric Robinson 
>> Sent: Tuesday, November 3, 2020 8:21 AM
>> To: Tomcat Users List 
>> Subject: RE: Weirdest Tomcat Behavior Ever?
>>
>>> From: Mark Thomas 
>>> Sent: Tuesday, November 3, 2020 2:06 AM
>>> To: Tomcat Users List 
>>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>>
>>> On 02/11/2020 12:16, Eric Robinson wrote:
>>>
>>> 
>>>
>>>> Gotcha, thanks for the clarification. Let's see what happens when
>>>> the users
>>> start hitting it at 8:00 am Eastern.
>>>
>>> Progress. The first attempt to write to the socket triggers the
>>> following
>>> exception:
>>>
>>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
>>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
>>> [301361476]
>>>  java.net.SocketException: Bad file descriptor (Write failed)
>>> at java.net.SocketOutputStream.socketWrite0(Native Method)
>>> at
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>>> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>>> at
>>>
...

>>> Because this is an instance of an IOException, Tomcat assumes it has
>>> been caused by the client dropping the connection and silently
>>> swallows it. I'll be changing that later today so the exception is
>>> logged as DEBUG level for all new Tomcat versions.
>>>
>>> Possible causes of "java.net.SocketException: Bad file descriptor"
>>> I've been able to find are:
>>>
>>> 1. OS running out of file descriptors.
>>>
>>> 2.Trying to use a closed socket.
>>>
>>> I want to review the source code to see if there are any others.
>>>
>>> I don't think we are seeing 2 as there is no indication of the Socket,
>>> InputStream or OutputStream being closed in the logs.
>>>
>>> That leaves 1. Possible causes here are a file descriptor leak or
>>> normal operations occasionally needing more than the current limit. I
>>> don't think it is a leak as I'd expect to see many more errors of this
>>> type after the first and we aren't seeing that. That leaves the
>>> possibility of the current limit being a little too low.
>>>
>>> My recommendation at this point is to increase the limit for file 
>>> descriptors.
>>> Meanwhile, I'll look at the JRE source to see if there are any other
>>> possible triggers for this exception.
>>>
>>> Mark
>>>
>>>
>>
>> On the tomcat server, max open file descriptors is currently 2853957
>>
>> [root@001app01a ~]# sysctl fs.file-max
>> fs.file-max = 2853957
>>
>> Most of the time, the number of open files appears to run about 600,000.
>>
>>  What do you think of watching the open file count and seeing if the number
>> gets up around the ceiling when the socket write failure occurs? Something
>> like...
>>
>> [root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo "$(date
>> +%H:%M:%S) $FILES";done
>> 09:11:15 591671
>> 09:11:35 627347
>> 09:11:54 626851
>> 09:12:11 626429
>> 09:12:26 545748
>> 09:12:42 548578
>> 09:12:58 551487
>> 09:13:14 516700
>> 09:13:30 513312
>> 09:13:45 512830
>> 09:14:02 58
>> 09:14:18 568233
>> 09:14:35 570158
>> 09:14:51 566269
>> 09:15:07 547389
>> 09:15:23 544203
>> 09:15:38 546395
>>
>> It's not ideal; as it seems to take 15-20 seconds to count them using lsof.
>>
>>
>>
> 
> Wait, never mind. I realized the per-process limits are what matters. I 
> checked, and nofile was set to 4096 for the relevant java process.
> 
> I did...
> 
> # prlimit --pid 8730 --nofile=16384:16384
> 
> That should give java some extra breathing room if the issue is max open 
> files, right?

>From my experience you should see a different exception if you hit the
NOFILE limit: java.net.SocketException: Too many open files

But I've only seen that when you open or accept a new connection. Never
seen this later when something is written to that already open socket.

To me a bad file descriptor sounds more like a closed socket. This
reminds me of a database or http-client connection pool that handed out
connections with already closed sockets. I think this could be
suspicious because Mark wrote this happens on the first write to the socket.

- Stefan

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-04 Thread Eric Robinson

> From: Mark Thomas 
> Sent: Wednesday, November 4, 2020 11:39 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 03/11/2020 15:05, Eric Robinson wrote:
> >> From: Eric Robinson 
> >>> From: Mark Thomas 
>
> 
>
> >>> Progress. The first attempt to write to the socket triggers the
> >>> following
> >>> exception:
> >>>
> >>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> >>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> >>> [301361476]
> >>>  java.net.SocketException: Bad file descriptor (Write failed)
>
> 
>
> >>> 1. OS running out of file descriptors.
> >>>
> >>> 2.Trying to use a closed socket.
> >>>
> >>> I want to review the source code to see if there are any others.
>
> There is an option 3 - a JVM bug. There were bugs in this area back in the
> 1.2/1.3/1.4 days. It seems unlikely that such a bug resurfaced now - 
> especially
> given that the issue happens with NIO as well as BIO.
>
> >>> I don't think we are seeing 2 as there is no indication of the
> >>> Socket, InputStream or OutputStream being closed in the logs.
> >>>
> >>> That leaves 1. Possible causes here are a file descriptor leak or
> >>> normal operations occasionally needing more than the current limit.
> >>> I don't think it is a leak as I'd expect to see many more errors of
> >>> this type after the first and we aren't seeing that. That leaves the
> >>> possibility of the current limit being a little too low.
> >>>
> >>> My recommendation at this point is to increase the limit for file
> descriptors.
> >>> Meanwhile, I'll look at the JRE source to see if there are any other
> >>> possible triggers for this exception.
>
> 
>
> > Wait, never mind. I realized the per-process limits are what matters. I
> checked, and nofile was set to 4096 for the relevant java process.
> >
> > I did...
> >
> > # prlimit --pid 8730 --nofile=16384:16384
> >
> > That should give java some extra breathing room if the issue is max open
> files, right?
>
> I'm not the person to ask that question. Linux administration is not an area 
> I'd
> consider myself sufficiently knowledgeable to give a definitive answer. It
> looks OK based on some quick searching.
>
> How have things been with the higher limit? More issues, fewer issues,
> about the same? Or maybe even no issues (he asks hopefully)?
>
> Mark
>

Not enough data collected to know yet. We did see at least one instance of the 
error, but I'll know better tomorrow.

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-04 Thread Mark Thomas

On 03/11/2020 15:05, Eric Robinson wrote:
>> From: Eric Robinson 
>>> From: Mark Thomas 



>>> Progress. The first attempt to write to the socket triggers the
>>> following
>>> exception:
>>>
>>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
>>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
>>> [301361476]
>>>  java.net.SocketException: Bad file descriptor (Write failed)



>>> 1. OS running out of file descriptors.
>>>
>>> 2.Trying to use a closed socket.
>>>
>>> I want to review the source code to see if there are any others.

There is an option 3 - a JVM bug. There were bugs in this area back in
the 1.2/1.3/1.4 days. It seems unlikely that such a bug resurfaced now -
especially given that the issue happens with NIO as well as BIO.

>>> I don't think we are seeing 2 as there is no indication of the Socket,
>>> InputStream or OutputStream being closed in the logs.
>>>
>>> That leaves 1. Possible causes here are a file descriptor leak or
>>> normal operations occasionally needing more than the current limit. I
>>> don't think it is a leak as I'd expect to see many more errors of this
>>> type after the first and we aren't seeing that. That leaves the
>>> possibility of the current limit being a little too low.
>>>
>>> My recommendation at this point is to increase the limit for file 
>>> descriptors.
>>> Meanwhile, I'll look at the JRE source to see if there are any other
>>> possible triggers for this exception.



> Wait, never mind. I realized the per-process limits are what matters. I 
> checked, and nofile was set to 4096 for the relevant java process.
> 
> I did...
> 
> # prlimit --pid 8730 --nofile=16384:16384
> 
> That should give java some extra breathing room if the issue is max open 
> files, right?

I'm not the person to ask that question. Linux administration is not an
area I'd consider myself sufficiently knowledgeable to give a definitive
answer. It looks OK based on some quick searching.

How have things been with the higher limit? More issues, fewer issues,
about the same? Or maybe even no issues (he asks hopefully)?

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Eric Robinson

> From: Christopher Schultz 
> Sent: Tuesday, November 3, 2020 9:26 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Eric,
>
> On 11/3/20 10:05, Eric Robinson wrote:
> >> -Original Message-
> >> From: Eric Robinson 
> >> Sent: Tuesday, November 3, 2020 8:21 AM
> >> To: Tomcat Users List 
> >> Subject: RE: Weirdest Tomcat Behavior Ever?
> >>
> >>> From: Mark Thomas 
> >>> Sent: Tuesday, November 3, 2020 2:06 AM
> >>> To: Tomcat Users List 
> >>> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>>
> >>> On 02/11/2020 12:16, Eric Robinson wrote:
> >>>
> >>> 
> >>>
> >>>> Gotcha, thanks for the clarification. Let's see what happens when
> >>>> the users
> >>> start hitting it at 8:00 am Eastern.
> >>>
> >>> Progress. The first attempt to write to the socket triggers the
> >>> following
> >>> exception:
> >>>
> >>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> >>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> >>> [301361476]
> >>>   java.net.SocketException: Bad file descriptor (Write failed)
> >>>  at java.net.SocketOutputStream.socketWrite0(Native Method)
> >>>  at
> >>>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> >>>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> >>>  at
> >>>
> >>
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEnd
> >> p
> >>> oint.java:1409)
> >>>  at
> >>> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(Interna
> >>> lO
> >>> ut
> >>> putBuffer.java:247)
> >>>  at
> >> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> >>>  at
> >>>
> >>
> org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutp
> >> u
> >>> t
> >>> Buffer.java:183)
> >>>  at
> >>> org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp
> >>> 11
> >>> Pr
> >>> ocessor.java:761)
> >>>  at org.apache.coyote.Response.action(Response.java:174)
> >>>  at org.apache.coyote.Response.finish(Response.java:274)
> >>>  at
> >>>
> org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
> >>>  at
> >>>
> >> org.apache.catalina.connector.Response.finishResponse(Response.java:5
> >> 37)
> >>>  at
> >>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.ja
> >>> va
> >>> :4
> >>> 80)
> >>>  at
> >>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHtt
> >>> p1
> >>> 1P
> >>> rocessor.java:1083)
> >>>  at
> >>>
> >>
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(
> >> A
> >>> b
> >>> stractProtocol.java:640)
> >>>  at
> >>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoi
> >>> nt
> >>> .ja
> >>> va:321)
> >>>  at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor
> >>> .j
> >>> av
> >>> a:1149)
> >>>  at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> >>> ja
> >>> va:624)
> >>>  at
> >>>
> >>
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskTh
> >> r
> >>> ead.java:61)
> >>>  at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> Because this is an instance of an IOException, Tomcat assumes it has
> >>> been caused by the client dropping the connection and silently
> >>> swallows it. I'll be changing that later today so the exception is
> >>> logged as DEBUG level for all new Tomcat versions.
> >>>
> >>> Possible causes of "java.net.SocketException: Bad file descriptor"
> >>> I've been able to find are:
> >>>
> >>> 1. OS running out of file descriptors.
> >>>
> >>> 2.Trying

Re: Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Christopher Schultz

Eric,

On 11/3/20 10:05, Eric Robinson wrote:

-Original Message-
From: Eric Robinson 
Sent: Tuesday, November 3, 2020 8:21 AM
To: Tomcat Users List 
Subject: RE: Weirdest Tomcat Behavior Ever?

From: Mark Thomas 
Sent: Tuesday, November 3, 2020 2:06 AM
To: Tomcat Users List 
Subject: Re: Weirdest Tomcat Behavior Ever?

On 02/11/2020 12:16, Eric Robinson wrote:

Gotcha, thanks for the clarification. Let's see what happens when
the users

start hitting it at 8:00 am Eastern.

Progress. The first attempt to write to the socket triggers the
following
exception:

02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
[301361476]
  java.net.SocketException: Bad file descriptor (Write failed)
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
 at

org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndp

oint.java:1409)
 at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalO
ut
putBuffer.java:247)
 at

org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)

 at

org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutpu

t
Buffer.java:183)
 at
org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11
Pr
ocessor.java:761)
 at org.apache.coyote.Response.action(Response.java:174)
 at org.apache.coyote.Response.finish(Response.java:274)
 at
org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
 at

org.apache.catalina.connector.Response.finishResponse(Response.java:537)

 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
:4
80)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp1
1P
rocessor.java:1083)
 at

org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(A

b
stractProtocol.java:640)
 at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint
.ja
va:321)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
av
a:1149)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
ja
va:624)
 at

org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThr

ead.java:61)
 at java.lang.Thread.run(Thread.java:748)

Because this is an instance of an IOException, Tomcat assumes it has
been caused by the client dropping the connection and silently
swallows it. I'll be changing that later today so the exception is
logged as DEBUG level for all new Tomcat versions.

Possible causes of "java.net.SocketException: Bad file descriptor"
I've been able to find are:

1. OS running out of file descriptors.

2.Trying to use a closed socket.

I want to review the source code to see if there are any others.

I don't think we are seeing 2 as there is no indication of the Socket,
InputStream or OutputStream being closed in the logs.

That leaves 1. Possible causes here are a file descriptor leak or
normal operations occasionally needing more than the current limit. I
don't think it is a leak as I'd expect to see many more errors of this
type after the first and we aren't seeing that. That leaves the
possibility of the current limit being a little too low.

My recommendation at this point is to increase the limit for file descriptors.
Meanwhile, I'll look at the JRE source to see if there are any other
possible triggers for this exception.

Mark

On the tomcat server, max open file descriptors is currently 2853957

[root@001app01a ~]# sysctl fs.file-max
fs.file-max = 2853957

Most of the time, the number of open files appears to run about 600,000.

  What do you think of watching the open file count and seeing if the number
gets up around the ceiling when the socket write failure occurs? Something
like...

[root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo "$(date
+%H:%M:%S) $FILES";done
09:11:15 591671
09:11:35 627347
09:11:54 626851
09:12:11 626429
09:12:26 545748
09:12:42 548578
09:12:58 551487
09:13:14 516700
09:13:30 513312
09:13:45 512830
09:14:02 58
09:14:18 568233
09:14:35 570158
09:14:51 566269
09:15:07 547389
09:15:23 544203
09:15:38 546395

It's not ideal; as it seems to take 15-20 seconds to count them using lsof.

Wait, never mind. I realized the per-process limits are what matters.
I checked, and nofile was set to 4096 for the relevant java process.

:)

Your  configuration is really simple:

How many of these do you have per server.xml? The default maxConnections 
for BIO is 8192 which was double your previous ulimit. If you have 10 of 
them, then you could potentially have 20 times the number of incoming 
connections than you have file descriptors available.

Have you noticed whether or not a large number of

Re: [OT] Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Christopher Schultz


Mark,

On 11/3/20 03:05, Mark Thomas wrote:

On 02/11/2020 12:16, Eric Robinson wrote:




Gotcha, thanks for the clarification. Let's see what happens when the users 
start hitting it at 8:00 am Eastern.


Progress. The first attempt to write to the socket triggers the
following exception:

02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write [301361476]
  java.net.SocketException: Bad file descriptor (Write failed)
     at java.net.SocketOutputStream.socketWrite0(Native Method)
     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
     at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
     at
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndpoint.java:1409)
     at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:247)
     at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
     at
org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutputBuffer.java:183)
     at
org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11Processor.java:761)
     at org.apache.coyote.Response.action(Response.java:174)
     at org.apache.coyote.Response.finish(Response.java:274)
     at
org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
     at
org.apache.catalina.connector.Response.finishResponse(Response.java:537)
     at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:480)
     at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1083)
     at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:640)
     at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:321)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
     at java.lang.Thread.run(Thread.java:748)

Because this is an instance of an IOException, Tomcat assumes it has
been caused by the client dropping the connection and silently swallows
it. I'll be changing that later today so the exception is logged as
DEBUG level for all new Tomcat versions.

Possible causes of "java.net.SocketException: Bad file descriptor" I've
been able to find are:

1. OS running out of file descriptors.

2. Trying to use a closed socket.


Wouldn't it be great if the Java library had more than a dozen or so 
IOException subclasses? It's really frustrating to try to figure out 
WHAT KIND of error occurred when all you get is IOException with a 
(possibly localized) error message. :(


-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Eric Robinson

> -Original Message-
> From: Eric Robinson 
> Sent: Tuesday, November 3, 2020 8:21 AM
> To: Tomcat Users List 
> Subject: RE: Weirdest Tomcat Behavior Ever?
>
> > From: Mark Thomas 
> > Sent: Tuesday, November 3, 2020 2:06 AM
> > To: Tomcat Users List 
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > On 02/11/2020 12:16, Eric Robinson wrote:
> >
> > 
> >
> > > Gotcha, thanks for the clarification. Let's see what happens when
> > > the users
> > start hitting it at 8:00 am Eastern.
> >
> > Progress. The first attempt to write to the socket triggers the
> > following
> > exception:
> >
> > 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> > org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> > [301361476]
> >  java.net.SocketException: Bad file descriptor (Write failed)
> > at java.net.SocketOutputStream.socketWrite0(Native Method)
> > at
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> > at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> > at
> >
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndp
> > oint.java:1409)
> > at
> > org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalO
> > ut
> > putBuffer.java:247)
> > at
> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> > at
> >
> org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutpu
> > t
> > Buffer.java:183)
> > at
> > org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11
> > Pr
> > ocessor.java:761)
> > at org.apache.coyote.Response.action(Response.java:174)
> > at org.apache.coyote.Response.finish(Response.java:274)
> > at
> > org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
> > at
> >
> org.apache.catalina.connector.Response.finishResponse(Response.java:537)
> > at
> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> > :4
> > 80)
> > at
> > org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp1
> > 1P
> > rocessor.java:1083)
> > at
> >
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(A
> > b
> > stractProtocol.java:640)
> > at
> > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint
> > .ja
> > va:321)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > av
> > a:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > ja
> > va:624)
> > at
> >
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThr
> > ead.java:61)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > Because this is an instance of an IOException, Tomcat assumes it has
> > been caused by the client dropping the connection and silently
> > swallows it. I'll be changing that later today so the exception is
> > logged as DEBUG level for all new Tomcat versions.
> >
> > Possible causes of "java.net.SocketException: Bad file descriptor"
> > I've been able to find are:
> >
> > 1. OS running out of file descriptors.
> >
> > 2.Trying to use a closed socket.
> >
> > I want to review the source code to see if there are any others.
> >
> > I don't think we are seeing 2 as there is no indication of the Socket,
> > InputStream or OutputStream being closed in the logs.
> >
> > That leaves 1. Possible causes here are a file descriptor leak or
> > normal operations occasionally needing more than the current limit. I
> > don't think it is a leak as I'd expect to see many more errors of this
> > type after the first and we aren't seeing that. That leaves the
> > possibility of the current limit being a little too low.
> >
> > My recommendation at this point is to increase the limit for file 
> > descriptors.
> > Meanwhile, I'll look at the JRE source to see if there are any other
> > possible triggers for this exception.
> >
> > Mark
> >
> >
>
> On the tomcat server, max open file descriptors is currently 2853957
>
> [root@001app01a ~]# sysctl fs.file-max
> fs.file-max = 2853957
>
> Most of the time, the number of open files appears to run about 600,000.
>
>  What do you think of watching the open file count and seeing if the

RE: Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Eric Robinson

> From: Mark Thomas 
> Sent: Tuesday, November 3, 2020 2:06 AM
> To: Tomcat Users List 
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 02/11/2020 12:16, Eric Robinson wrote:
>
> 
>
> > Gotcha, thanks for the clarification. Let's see what happens when the users
> start hitting it at 8:00 am Eastern.
>
> Progress. The first attempt to write to the socket triggers the following
> exception:
>
> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> [301361476]
>  java.net.SocketException: Bad file descriptor (Write failed)
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> at
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndp
> oint.java:1409)
> at
> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut
> putBuffer.java:247)
> at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> at
> org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutput
> Buffer.java:183)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11Pr
> ocessor.java:761)
> at org.apache.coyote.Response.action(Response.java:174)
> at org.apache.coyote.Response.finish(Response.java:274)
> at
> org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
> at
> org.apache.catalina.connector.Response.finishResponse(Response.java:537)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:4
> 80)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11P
> rocessor.java:1083)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(Ab
> stractProtocol.java:640)
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.ja
> va:321)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav
> a:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
> va:624)
> at
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThr
> ead.java:61)
> at java.lang.Thread.run(Thread.java:748)
>
> Because this is an instance of an IOException, Tomcat assumes it has been
> caused by the client dropping the connection and silently swallows it. I'll be
> changing that later today so the exception is logged as DEBUG level for all
> new Tomcat versions.
>
> Possible causes of "java.net.SocketException: Bad file descriptor" I've been
> able to find are:
>
> 1. OS running out of file descriptors.
>
> 2.Trying to use a closed socket.
>
> I want to review the source code to see if there are any others.
>
> I don't think we are seeing 2 as there is no indication of the Socket,
> InputStream or OutputStream being closed in the logs.
>
> That leaves 1. Possible causes here are a file descriptor leak or normal
> operations occasionally needing more than the current limit. I don't think it 
> is
> a leak as I'd expect to see many more errors of this type after the first and
> we aren't seeing that. That leaves the possibility of the current limit being 
> a
> little too low.
>
> My recommendation at this point is to increase the limit for file descriptors.
> Meanwhile, I'll look at the JRE source to see if there are any other possible
> triggers for this exception.
>
> Mark
>
>

On the tomcat server, max open file descriptors is currently 2853957

[root@001app01a ~]# sysctl fs.file-max
fs.file-max = 2853957

Most of the time, the number of open files appears to run about 600,000.

 What do you think of watching the open file count and seeing if the number 
gets up around the ceiling when the socket write failure occurs? Something 
like...

[root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo "$(date 
+%H:%M:%S) $FILES";done
09:11:15 591671
09:11:35 627347
09:11:54 626851
09:12:11 626429
09:12:26 545748
09:12:42 548578
09:12:58 551487
09:13:14 516700
09:13:30 513312
09:13:45 512830
09:14:02 58
09:14:18 568233
09:14:35 570158
09:14:51 566269
09:15:07 547389
09:15:23 544203
09:15:38 546395

It's not ideal; as it seems to take 15-20 seconds to count them using lsof.



Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-11-03 Thread Mark Thomas

On 02/11/2020 12:16, Eric Robinson wrote:

> Gotcha, thanks for the clarification. Let's see what happens when the users 
> start hitting it at 8:00 am Eastern.

Progress. The first attempt to write to the socket triggers the
following exception:

02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write [301361476]
 java.net.SocketException: Bad file descriptor (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at
org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEndpoint.java:1409)
    at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOutputBuffer.java:247)
    at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
    at
org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutputBuffer.java:183)
    at
org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp11Processor.java:761)
    at org.apache.coyote.Response.action(Response.java:174)
    at org.apache.coyote.Response.finish(Response.java:274)
    at
org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
    at
org.apache.catalina.connector.Response.finishResponse(Response.java:537)
    at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:480)
    at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1083)
    at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:640)
    at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:321)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)

Because this is an instance of an IOException, Tomcat assumes it has
been caused by the client dropping the connection and silently swallows
it. I'll be changing that later today so the exception is logged as
DEBUG level for all new Tomcat versions.

Possible causes of "java.net.SocketException: Bad file descriptor" I've
been able to find are:

1. OS running out of file descriptors.

2.Trying to use a closed socket.

I want to review the source code to see if there are any others.

I don't think we are seeing 2 as there is no indication of the Socket,
InputStream or OutputStream being closed in the logs.

That leaves 1. Possible causes here are a file descriptor leak or normal
operations occasionally needing more than the current limit. I don't
think it is a leak as I'd expect to see many more errors of this type
after the first and we aren't seeing that. That leaves the possibility
of the current limit being a little too low.

My recommendation at this point is to increase the limit for file
descriptors. Meanwhile, I'll look at the JRE source to see if there are
any other possible triggers for this exception.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-11-02 Thread Eric Robinson

> From: Mark Thomas 
> Sent: Monday, November 2, 2020 5:38 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 02/11/2020 11:18, Eric Robinson wrote:
> >> -Original Message-
> >> From: Mark Thomas 
> >> Sent: Sunday, November 1, 2020 11:50 AM
> >> To: users@tomcat.apache.org
> >> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>
> >> On 01/11/2020 16:25, Mark Thomas wrote:
> >>> 
> >>>
> >>> Keeping the previous logs for reference:
> >>>
> >>>>> Source  Time Activity
> >>>>> 
> >>>>> pcap15:14:25.375451  SYN proxy to Tomcat
> >>>>> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
> >>>>> pcap15:14:25.375839  ACK proxy to Tomcat
> >>>>> pcap15:14:25.375892  GET request proxy to Tomcat
> >>>>> pcap15:14:25.375911  ACK Tomcat to proxy
> >>>>> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
> >>>>> pcap15:14:25.376777  FIN Tomcat to proxy
> >>>>> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
> >>>>> pcap15:14:25.377048  ACK Tomcat proxy
> >>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
> >>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
> >>>>> debug   15:14:25.383
> o.a.c.http11.InternalOutputBuffer$1.nextRequest
> >>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
> >>>>>
> >>>>> Tomcat writes the request body to the buffer but when Tomcat tries
> >>>>> to flush those bytes out to the network it finds that the socket
> >>>>> has been closed. That normally indicates that the client has
> >>>>> dropped the connection. This is sufficiently common that Tomcat
> >>>>> swallows the exception. However, we know (from the pcap file) that
> >>>>> the client did not drop the connection, Tomcat did.
> >>>
> >>> The next round of logging added a little more detail for the
> >>> InternalOutputBuffer and wrapped the Socket to see when close() was
> >> called.
> >>>
> >>> The results from the next round of logging are:
> >>>
> >>> Source  Time Activity
> >>> 
> >>> pcap13:31:26.344453  SYN proxy to Tomcat
> >>> pcap13:31:26.344481  SYN, ACK Tomcat to proxy
> >>> debug   13:31:26.345 DebugSocket object created
> >>> debug   13:31:26.345 o.a.c.http11.InternalOutputBuffer.init
> >>> pcap13:31:26.345138  ACK proxy to Tomcat
> >>> pcap13:31:26.345174  GET request proxy to Tomcat
> >>> pcap13:31:26.345194  ACK Tomcat to proxy
> >>> pcap13:31:26.395281  FIN, ACK Tomcat to proxy
> >>> pcap13:31:26.395725  ACK proxy to Tomcat
> >>> pcap13:31:26.395741  FIN, ACK proxy to Tomcat
> >>> pcap13:31:26.395754  ACK Tomcat to proxy
> >>> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer.commit
> >>> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer$1.doWrite
> >>> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.nextRequest
> >>> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.recycle
> >>> debug   13:31:26.404 DebugSocket.close called
> >>>
> >>> This shows that the socket is closed long before Tomcat tries to
> >>> write to it (that would be after the doWrite but before nextRequest)
> >>> or Tomcat explicitly closes the socket.
> >>>
> >>> This also shows that neither Tomcat nor the application are directly
> >>> calling close() on the socket to trigger the close shown by pcap.
> >>>
> >>> I continue to come up with theories as to what might be happening
> >>> but they all seem unlikely.
> >>>
> >>> This is the BIO connector so the only time the socket should change
> >>> state is during a method call. While it might seem a little over the
> >>> top I think the next step is to log every single method call to
> >>> DebugSocket along with any exception generated. We need to try and
> >>> correlate the premature socket closure with something in the JVM. If
> >>>

Re: Weirdest Tomcat Behavior Ever?

2020-11-02 Thread Mark Thomas

On 02/11/2020 11:18, Eric Robinson wrote:
>> -Original Message-
>> From: Mark Thomas 
>> Sent: Sunday, November 1, 2020 11:50 AM
>> To: users@tomcat.apache.org
>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>
>> On 01/11/2020 16:25, Mark Thomas wrote:
>>> 
>>>
>>> Keeping the previous logs for reference:
>>>
>>>>> Source  Time Activity
>>>>> 
>>>>> pcap15:14:25.375451  SYN proxy to Tomcat
>>>>> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
>>>>> pcap15:14:25.375839  ACK proxy to Tomcat
>>>>> pcap15:14:25.375892  GET request proxy to Tomcat
>>>>> pcap15:14:25.375911  ACK Tomcat to proxy
>>>>> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
>>>>> pcap15:14:25.376777  FIN Tomcat to proxy
>>>>> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
>>>>> pcap15:14:25.377048  ACK Tomcat proxy
>>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
>>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
>>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
>>>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
>>>>>
>>>>> Tomcat writes the request body to the buffer but when Tomcat tries
>>>>> to flush those bytes out to the network it finds that the socket has
>>>>> been closed. That normally indicates that the client has dropped the
>>>>> connection. This is sufficiently common that Tomcat swallows the
>>>>> exception. However, we know (from the pcap file) that the client did
>>>>> not drop the connection, Tomcat did.
>>>
>>> The next round of logging added a little more detail for the
>>> InternalOutputBuffer and wrapped the Socket to see when close() was
>> called.
>>>
>>> The results from the next round of logging are:
>>>
>>> Source  Time Activity
>>> 
>>> pcap13:31:26.344453  SYN proxy to Tomcat
>>> pcap13:31:26.344481  SYN, ACK Tomcat to proxy
>>> debug   13:31:26.345 DebugSocket object created
>>> debug   13:31:26.345 o.a.c.http11.InternalOutputBuffer.init
>>> pcap13:31:26.345138  ACK proxy to Tomcat
>>> pcap13:31:26.345174  GET request proxy to Tomcat
>>> pcap13:31:26.345194  ACK Tomcat to proxy
>>> pcap13:31:26.395281  FIN, ACK Tomcat to proxy
>>> pcap13:31:26.395725  ACK proxy to Tomcat
>>> pcap13:31:26.395741  FIN, ACK proxy to Tomcat
>>> pcap13:31:26.395754  ACK Tomcat to proxy
>>> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer.commit
>>> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer$1.doWrite
>>> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.nextRequest
>>> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.recycle
>>> debug   13:31:26.404 DebugSocket.close called
>>>
>>> This shows that the socket is closed long before Tomcat tries to write
>>> to it (that would be after the doWrite but before nextRequest) or
>>> Tomcat explicitly closes the socket.
>>>
>>> This also shows that neither Tomcat nor the application are directly
>>> calling close() on the socket to trigger the close shown by pcap.
>>>
>>> I continue to come up with theories as to what might be happening but
>>> they all seem unlikely.
>>>
>>> This is the BIO connector so the only time the socket should change
>>> state is during a method call. While it might seem a little over the
>>> top I think the next step is to log every single method call to
>>> DebugSocket along with any exception generated. We need to try and
>>> correlate the premature socket closure with something in the JVM. If
>>> there is no correlation to find then we really are into the realm of
>>> very strange JVM  and/or OS bugs.
>>>
>>> I'll start work on v3 of the debug patch.
>>
>> http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v3/
>>
>> I opted to wrap both the InputStream and OutputStream associated with the
>> socket and log every method call and every exception thrown. I opted not to
>> log parameters with the exception of socket timeout since that might be
>>

RE: Weirdest Tomcat Behavior Ever?

2020-11-02 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Sunday, November 1, 2020 11:50 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 01/11/2020 16:25, Mark Thomas wrote:
> > 
> >
> > Keeping the previous logs for reference:
> >
> >>> Source  Time Activity
> >>> 
> >>> pcap15:14:25.375451  SYN proxy to Tomcat
> >>> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
> >>> pcap15:14:25.375839  ACK proxy to Tomcat
> >>> pcap15:14:25.375892  GET request proxy to Tomcat
> >>> pcap15:14:25.375911  ACK Tomcat to proxy
> >>> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
> >>> pcap15:14:25.376777  FIN Tomcat to proxy
> >>> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
> >>> pcap15:14:25.377048  ACK Tomcat proxy
> >>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
> >>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
> >>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
> >>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
> >>>
> >>> Tomcat writes the request body to the buffer but when Tomcat tries
> >>> to flush those bytes out to the network it finds that the socket has
> >>> been closed. That normally indicates that the client has dropped the
> >>> connection. This is sufficiently common that Tomcat swallows the
> >>> exception. However, we know (from the pcap file) that the client did
> >>> not drop the connection, Tomcat did.
> >
> > The next round of logging added a little more detail for the
> > InternalOutputBuffer and wrapped the Socket to see when close() was
> called.
> >
> > The results from the next round of logging are:
> >
> > Source  Time Activity
> > 
> > pcap13:31:26.344453  SYN proxy to Tomcat
> > pcap13:31:26.344481  SYN, ACK Tomcat to proxy
> > debug   13:31:26.345 DebugSocket object created
> > debug   13:31:26.345 o.a.c.http11.InternalOutputBuffer.init
> > pcap13:31:26.345138  ACK proxy to Tomcat
> > pcap13:31:26.345174  GET request proxy to Tomcat
> > pcap13:31:26.345194  ACK Tomcat to proxy
> > pcap13:31:26.395281  FIN, ACK Tomcat to proxy
> > pcap13:31:26.395725  ACK proxy to Tomcat
> > pcap13:31:26.395741  FIN, ACK proxy to Tomcat
> > pcap13:31:26.395754  ACK Tomcat to proxy
> > debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer.commit
> > debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer$1.doWrite
> > debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.nextRequest
> > debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.recycle
> > debug   13:31:26.404 DebugSocket.close called
> >
> > This shows that the socket is closed long before Tomcat tries to write
> > to it (that would be after the doWrite but before nextRequest) or
> > Tomcat explicitly closes the socket.
> >
> > This also shows that neither Tomcat nor the application are directly
> > calling close() on the socket to trigger the close shown by pcap.
> >
> > I continue to come up with theories as to what might be happening but
> > they all seem unlikely.
> >
> > This is the BIO connector so the only time the socket should change
> > state is during a method call. While it might seem a little over the
> > top I think the next step is to log every single method call to
> > DebugSocket along with any exception generated. We need to try and
> > correlate the premature socket closure with something in the JVM. If
> > there is no correlation to find then we really are into the realm of
> > very strange JVM  and/or OS bugs.
> >
> > I'll start work on v3 of the debug patch.
>
> http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v3/
>
> I opted to wrap both the InputStream and OutputStream associated with the
> socket and log every method call and every exception thrown. I opted not to
> log parameters with the exception of socket timeout since that might be
> relevant.
>
> The debug logs will be noticeably more verbose that last time. Feel free to
> adjust the number/size of debug log files as suits your environment.
>
> Mark
>

I applied the V3 patch and confirmed that content is being written to 
debug0.log, but right away I noticed an error recur

Re: Weirdest Tomcat Behavior Ever?

2020-11-01 Thread Mark Thomas

On 01/11/2020 16:25, Mark Thomas wrote:
> 
> 
> Keeping the previous logs for reference:
> 
>>> Source  Time Activity
>>> 
>>> pcap15:14:25.375451  SYN proxy to Tomcat
>>> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
>>> pcap15:14:25.375839  ACK proxy to Tomcat
>>> pcap15:14:25.375892  GET request proxy to Tomcat
>>> pcap15:14:25.375911  ACK Tomcat to proxy
>>> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
>>> pcap15:14:25.376777  FIN Tomcat to proxy
>>> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
>>> pcap15:14:25.377048  ACK Tomcat proxy
>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
>>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
>>>
>>> Tomcat writes the request body to the buffer but when Tomcat tries to flush
>>> those bytes out to the network it finds that the socket has been closed. 
>>> That
>>> normally indicates that the client has dropped the connection. This is
>>> sufficiently common that Tomcat swallows the exception. However, we
>>> know (from the pcap file) that the client did not drop the connection, 
>>> Tomcat
>>> did.
> 
> The next round of logging added a little more detail for the
> InternalOutputBuffer and wrapped the Socket to see when close() was called.
> 
> The results from the next round of logging are:
> 
> Source  Time Activity
> 
> pcap13:31:26.344453  SYN proxy to Tomcat
> pcap13:31:26.344481  SYN, ACK Tomcat to proxy
> debug   13:31:26.345 DebugSocket object created
> debug   13:31:26.345 o.a.c.http11.InternalOutputBuffer.init
> pcap13:31:26.345138  ACK proxy to Tomcat
> pcap13:31:26.345174  GET request proxy to Tomcat
> pcap13:31:26.345194  ACK Tomcat to proxy
> pcap13:31:26.395281  FIN, ACK Tomcat to proxy
> pcap13:31:26.395725  ACK proxy to Tomcat
> pcap13:31:26.395741  FIN, ACK proxy to Tomcat
> pcap13:31:26.395754  ACK Tomcat to proxy
> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer.commit
> debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer$1.doWrite
> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.nextRequest
> debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.recycle
> debug   13:31:26.404 DebugSocket.close called
> 
> This shows that the socket is closed long before Tomcat tries to write
> to it (that would be after the doWrite but before nextRequest) or Tomcat
> explicitly closes the socket.
> 
> This also shows that neither Tomcat nor the application are directly
> calling close() on the socket to trigger the close shown by pcap.
> 
> I continue to come up with theories as to what might be happening but
> they all seem unlikely.
> 
> This is the BIO connector so the only time the socket should change
> state is during a method call. While it might seem a little over the top
> I think the next step is to log every single method call to DebugSocket
> along with any exception generated. We need to try and correlate the
> premature socket closure with something in the JVM. If there is no
> correlation to find then we really are into the realm of very strange
> JVM  and/or OS bugs.
> 
> I'll start work on v3 of the debug patch.

http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v3/

I opted to wrap both the InputStream and OutputStream associated with
the socket and log every method call and every exception thrown. I opted
not to log parameters with the exception of socket timeout since that
might be relevant.

The debug logs will be noticeably more verbose that last time. Feel free
to adjust the number/size of debug log files as suits your environment.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-11-01 Thread Mark Thomas



Keeping the previous logs for reference:

>> Source  Time Activity
>> 
>> pcap15:14:25.375451  SYN proxy to Tomcat
>> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
>> pcap15:14:25.375839  ACK proxy to Tomcat
>> pcap15:14:25.375892  GET request proxy to Tomcat
>> pcap15:14:25.375911  ACK Tomcat to proxy
>> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
>> pcap15:14:25.376777  FIN Tomcat to proxy
>> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
>> pcap15:14:25.377048  ACK Tomcat proxy
>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
>> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
>>
>> Tomcat writes the request body to the buffer but when Tomcat tries to flush
>> those bytes out to the network it finds that the socket has been closed. That
>> normally indicates that the client has dropped the connection. This is
>> sufficiently common that Tomcat swallows the exception. However, we
>> know (from the pcap file) that the client did not drop the connection, Tomcat
>> did.

The next round of logging added a little more detail for the
InternalOutputBuffer and wrapped the Socket to see when close() was called.

The results from the next round of logging are:

Source  Time Activity

pcap13:31:26.344453  SYN proxy to Tomcat
pcap13:31:26.344481  SYN, ACK Tomcat to proxy
debug   13:31:26.345 DebugSocket object created
debug   13:31:26.345 o.a.c.http11.InternalOutputBuffer.init
pcap13:31:26.345138  ACK proxy to Tomcat
pcap13:31:26.345174  GET request proxy to Tomcat
pcap13:31:26.345194  ACK Tomcat to proxy
pcap13:31:26.395281  FIN, ACK Tomcat to proxy
pcap13:31:26.395725  ACK proxy to Tomcat
pcap13:31:26.395741  FIN, ACK proxy to Tomcat
pcap13:31:26.395754  ACK Tomcat to proxy
debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer.commit
debug   13:31:26.403 o.a.c.http11.InternalOutputBuffer$1.doWrite
debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.nextRequest
debug   13:31:26.404 o.a.c.http11.InternalOutputBuffer$1.recycle
debug   13:31:26.404 DebugSocket.close called

This shows that the socket is closed long before Tomcat tries to write
to it (that would be after the doWrite but before nextRequest) or Tomcat
explicitly closes the socket.

This also shows that neither Tomcat nor the application are directly
calling close() on the socket to trigger the close shown by pcap.

I continue to come up with theories as to what might be happening but
they all seem unlikely.

This is the BIO connector so the only time the socket should change
state is during a method call. While it might seem a little over the top
I think the next step is to log every single method call to DebugSocket
along with any exception generated. We need to try and correlate the
premature socket closure with something in the JVM. If there is no
correlation to find then we really are into the realm of very strange
JVM  and/or OS bugs.

I'll start work on v3 of the debug patch.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-29 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Thursday, October 29, 2020 5:45 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 28/10/2020 20:32, Mark Thomas wrote:
>
> 
>
> > I have the off-list mail and will start looking at the logs shortly.
>
> Progress. I think. I'll start with the following summary of the log data.
>
> Source  Time Activity
> 
> pcap15:14:25.375451  SYN proxy to Tomcat
> pcap15:14:25.375493  SYN, ACK Tomcat to proxy
> pcap15:14:25.375839  ACK proxy to Tomcat
> pcap15:14:25.375892  GET request proxy to Tomcat
> pcap15:14:25.375911  ACK Tomcat to proxy
> debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
> pcap15:14:25.376777  FIN Tomcat to proxy
> pcap15:14:25.377036  FIN, ACK proxy to Tomcat
> pcap15:14:25.377048  ACK Tomcat proxy
> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
> debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle
>
> Tomcat writes the request body to the buffer but when Tomcat tries to flush
> those bytes out to the network it finds that the socket has been closed. That
> normally indicates that the client has dropped the connection. This is
> sufficiently common that Tomcat swallows the exception. However, we
> know (from the pcap file) that the client did not drop the connection, Tomcat
> did.
>

That's the first hard evidence of where the problem lies. I feel like we're 
zeroing in on it.

> What is strange here is that with BIO is there is a 1-2-1 relationship between
> threads and sockets for the life of the socket. While I can see how a retained
> reference could truncate a response (including the
> headers) I don't yet see how the socket could get closed.
>
> I think more debug logging is required. I am currently working on that.
>

I'll apply the new patch and restart the tomcat this evening. Just to be safe, 
I'm only applying it to one of the tomcat instances.

--Eric

> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-10-29 Thread Mark Thomas

On 29/10/2020 10:44, Mark Thomas wrote:



> I think more debug logging is required. I am currently working on that.

v2 of the debug logging patch is:
http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v2/

Full disclosure: This patch is more invasive. I had to wrap the
java.net.Socket objects to ensure I catch all attempts to close the
socket. That patch should be safe but it goes further than just adding
some log messages.

Source code for patches:
https://github.com/markt-asf/tomcat/commits/debug-7.0.72

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-10-29 Thread Mark Thomas

On 28/10/2020 20:32, Mark Thomas wrote:



> I have the off-list mail and will start looking at the logs shortly.

Progress. I think. I'll start with the following summary of the log data.

Source  Time Activity

pcap15:14:25.375451  SYN proxy to Tomcat
pcap15:14:25.375493  SYN, ACK Tomcat to proxy
pcap15:14:25.375839  ACK proxy to Tomcat
pcap15:14:25.375892  GET request proxy to Tomcat
pcap15:14:25.375911  ACK Tomcat to proxy
debug   15:14:25.376 o.a.c.http11.InternalOutputBuffer.init
pcap15:14:25.376777  FIN Tomcat to proxy
pcap15:14:25.377036  FIN, ACK proxy to Tomcat
pcap15:14:25.377048  ACK Tomcat proxy
debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer.commit
debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.doWrite
debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.nextRequest
debug   15:14:25.383 o.a.c.http11.InternalOutputBuffer$1.recycle

Tomcat writes the request body to the buffer but when Tomcat tries to
flush those bytes out to the network it finds that the socket has been
closed. That normally indicates that the client has dropped the
connection. This is sufficiently common that Tomcat swallows the
exception. However, we know (from the pcap file) that the client did not
drop the connection, Tomcat did.

What is strange here is that with BIO is there is a 1-2-1 relationship
between threads and sockets for the life of the socket. While I can see
how a retained reference could truncate a response (including the
headers) I don't yet see how the socket could get closed.

I think more debug logging is required. I am currently working on that.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-10-28 Thread Mark Thomas

On 28/10/2020 20:06, Eric Robinson wrote:
>> From: Eric Robinson 
>> Sent: Tuesday, October 27, 2020 11:33 PM
>> To: Tomcat Users List 
>> Subject: RE: Weirdest Tomcat Behavior Ever?
>>
>>> From: Mark Thomas 
>>> Sent: Tuesday, October 27, 2020 12:06 PM
>>> To: users@tomcat.apache.org
>>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>>
>>> On 27/10/2020 16:29, Eric Robinson wrote:
>>>>> On 27/10/2020 15:22, Eric Robinson wrote:
>>>
>>> 
>>>
>>>>>> I had switched to the NIO connector last week. Is that why the
>>>>>> logs are still
>>>>> at 0 bytes?
>>>>>
>>>>> Yes. I only added the debug logging to BIO.
>>>>>
>>>>> Somewhere in a previous thread I recommended switching back to BIO
>>>>> as the code is simpler and therefore easier to debug.
>>>>>
>>>>> Note that my previous analysis of the access log valve entries was
>>>>> based on the assumption you had switched back to BIO.
>>>>>
>>>>> Given everything else is in place, I'd suggest switching back to
>>>>> BIO when you can, waiting for the issue to re-appear and then
>>>>> looking at the
>>> debug logs.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>
>>>> Oh man, I must have missed the recommendation to switch back to BIO.
>>>> I
>>> will do that ASAP, most likely this evening. My apologies for causing
>>> the wasted effort.
>>>
>>> No worries.
>>>
>>> Looking at the NIO code, if you had switched back to BIO last night
>>> I'm fairly sure we'd now be adding the debug logging to BIO anyway so
>>> we'll end up in the same place.
>>>
>>> Now seems like a good time to point out that, as email based debugging
>>> sessions go, this is very much at the better end. Having someone as
>>> responsive as you are, who actually answers the questions asked and is
>>> willing to run custom patches to collect debug info makes it much more
>>> likely that not only will we reach a successful conclusion, but that
>>> we'll get there reasonably quickly.
>>>
>>
>> All I can say is that I greatly appreciate the detailed help!
>>
>> Tomcat has been put back into BIO mode and has been restarted. I
>> confirmed that the debug0.log file is now accumulating data.
>>
>> I am very eager to see what turns up in the logs Wednesday morning.
>>
>>
>>> Thanks,
>>>
>>> Mark
>>>
> 
> The custom patch restarted the debug0.log every few minutes and did not keep 
> previous copies, so it took me several tries today to catch the problem in 
> the act, but I finally did. I emailed you the pcaps and logs off list.

Sorry about that. I must have missed an setting for the FileHandler.

I have the off-list mail and will start looking at the logs shortly.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-28 Thread Eric Robinson

> From: Eric Robinson 
> Sent: Tuesday, October 27, 2020 11:33 PM
> To: Tomcat Users List 
> Subject: RE: Weirdest Tomcat Behavior Ever?
>
> > From: Mark Thomas 
> > Sent: Tuesday, October 27, 2020 12:06 PM
> > To: users@tomcat.apache.org
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > On 27/10/2020 16:29, Eric Robinson wrote:
> > >> On 27/10/2020 15:22, Eric Robinson wrote:
> >
> > 
> >
> > >>> I had switched to the NIO connector last week. Is that why the
> > >>> logs are still
> > >> at 0 bytes?
> > >>
> > >> Yes. I only added the debug logging to BIO.
> > >>
> > >> Somewhere in a previous thread I recommended switching back to BIO
> > >> as the code is simpler and therefore easier to debug.
> > >>
> > >> Note that my previous analysis of the access log valve entries was
> > >> based on the assumption you had switched back to BIO.
> > >>
> > >> Given everything else is in place, I'd suggest switching back to
> > >> BIO when you can, waiting for the issue to re-appear and then
> > >> looking at the
> > debug logs.
> > >>
> > >> Mark
> > >>
> > >>
> > >
> > > Oh man, I must have missed the recommendation to switch back to BIO.
> > > I
> > will do that ASAP, most likely this evening. My apologies for causing
> > the wasted effort.
> >
> > No worries.
> >
> > Looking at the NIO code, if you had switched back to BIO last night
> > I'm fairly sure we'd now be adding the debug logging to BIO anyway so
> > we'll end up in the same place.
> >
> > Now seems like a good time to point out that, as email based debugging
> > sessions go, this is very much at the better end. Having someone as
> > responsive as you are, who actually answers the questions asked and is
> > willing to run custom patches to collect debug info makes it much more
> > likely that not only will we reach a successful conclusion, but that
> > we'll get there reasonably quickly.
> >
>
> All I can say is that I greatly appreciate the detailed help!
>
> Tomcat has been put back into BIO mode and has been restarted. I
> confirmed that the debug0.log file is now accumulating data.
>
> I am very eager to see what turns up in the logs Wednesday morning.
>
>
> > Thanks,
> >
> > Mark
> >

The custom patch restarted the debug0.log every few minutes and did not keep 
previous copies, so it took me several tries today to catch the problem in the 
act, but I finally did. I emailed you the pcaps and logs off list.

--Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

RE: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Eric Robinson

> From: Mark Thomas 
> Sent: Tuesday, October 27, 2020 12:06 PM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 27/10/2020 16:29, Eric Robinson wrote:
> >> On 27/10/2020 15:22, Eric Robinson wrote:
>
> 
>
> >>> I had switched to the NIO connector last week. Is that why the logs
> >>> are still
> >> at 0 bytes?
> >>
> >> Yes. I only added the debug logging to BIO.
> >>
> >> Somewhere in a previous thread I recommended switching back to BIO as
> >> the code is simpler and therefore easier to debug.
> >>
> >> Note that my previous analysis of the access log valve entries was
> >> based on the assumption you had switched back to BIO.
> >>
> >> Given everything else is in place, I'd suggest switching back to BIO
> >> when you can, waiting for the issue to re-appear and then looking at the
> debug logs.
> >>
> >> Mark
> >>
> >>
> >
> > Oh man, I must have missed the recommendation to switch back to BIO. I
> will do that ASAP, most likely this evening. My apologies for causing the
> wasted effort.
>
> No worries.
>
> Looking at the NIO code, if you had switched back to BIO last night I'm fairly
> sure we'd now be adding the debug logging to BIO anyway so we'll end up in
> the same place.
>
> Now seems like a good time to point out that, as email based debugging
> sessions go, this is very much at the better end. Having someone as
> responsive as you are, who actually answers the questions asked and is
> willing to run custom patches to collect debug info makes it much more likely
> that not only will we reach a successful conclusion, but that we'll get there
> reasonably quickly.
>

All I can say is that I greatly appreciate the detailed help!

Tomcat has been put back into BIO mode and has been restarted. I confirmed that 
the debug0.log file is now accumulating data.

I am very eager to see what turns up in the logs Wednesday morning.

> Thanks,
>
> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Re: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Mark Thomas

On 27/10/2020 16:29, Eric Robinson wrote:
>> On 27/10/2020 15:22, Eric Robinson wrote:

>>> I had switched to the NIO connector last week. Is that why the logs are 
>>> still
>> at 0 bytes?
>>
>> Yes. I only added the debug logging to BIO.
>>
>> Somewhere in a previous thread I recommended switching back to BIO as
>> the code is simpler and therefore easier to debug.
>>
>> Note that my previous analysis of the access log valve entries was based on
>> the assumption you had switched back to BIO.
>>
>> Given everything else is in place, I'd suggest switching back to BIO when you
>> can, waiting for the issue to re-appear and then looking at the debug logs.
>>
>> Mark
>>
>>
> 
> Oh man, I must have missed the recommendation to switch back to BIO. I will 
> do that ASAP, most likely this evening. My apologies for causing the wasted 
> effort.

No worries.

Looking at the NIO code, if you had switched back to BIO last night I'm
fairly sure we'd now be adding the debug logging to BIO anyway so we'll
end up in the same place.

Now seems like a good time to point out that, as email based debugging
sessions go, this is very much at the better end. Having someone as
responsive as you are, who actually answers the questions asked and is
willing to run custom patches to collect debug info makes it much more
likely that not only will we reach a successful conclusion, but that
we'll get there reasonably quickly.

Thanks,

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Eric Robinson

> On 27/10/2020 15:22, Eric Robinson wrote:
> >> On 27/10/2020 09:16, Mark Thomas wrote:
> >>> On 27/10/2020 04:43, Eric Robinson wrote:
> >>>
> >>> 
> >>>
>  Any changes in the Nginx configuration in the relevant timescale?
> 
> >>
> >> The last change to the nginx config files was on 8/21. The first
> >> report of problems from the users in question was on 9/16. There
> >> is another set of users on a different tomcat instance who
> >> reported issues around 8/26, 5 days after nginx config change. It
> >> seems unlikely to be related. Also, I can't imagine what nginx
> >> could be sending that would induce the upstream tomcat to behave
> this way.
> >>>
> >>> If there is some sort of retaining references to request/response/etc.
> >>> at the root of this then that sort of issue is very sensitive to timing.
> >>> For example, I have had reliable reproduction cases in the past that
> >>> stopped working with the addition of a single debug statement. Any
> >>> sort of configuration change might have changed the timing
> >>> sufficiently to trigger the issue
> >>>
> >>> At this point, I'd say the Nginx config change might be a potential
> >>> trigger if the root cause is retaining references.
> >>>
>  Any updates to the application in the relevant timescale?
> 
> >>
> >> Their application was patched to a newer version on 6/5.
> >>>
> >>> That seems far enough away to be unlikely.
> >>>
>  Any features users started using that hadn't been used before
>  in that timescale?
> >>
> >> That one I couldn't answer, as we are only the hosting facility
> >> and we are not in the loop when it comes to the users' workflow,
> >> but it seems unlikely given the nature of their business.
> >>>
> >>> Fair enough. That one was a bit of a shot in the dark.
> >>>
> >>> 
> >>>
> >> 1. Now that you have provided this patch, should I still enable
> >> RECYCLE_FACADES=true?
> >>>
> >>> I'd recommend yes. At least until the issue is resolved.
> >>>
> >> 2. The servers in question are multi-tenanted. There are 17
> >> instances of tomcat, each running on a different set of ports and
> >> out of a separate directory tree, but they are all started with
> >> essentially the same init script, which exports certain
> >> site-specific path variables and executes tomcat from the
> >> specified folder structure. Can you think of any potential issues
> >> where making this change for one instance could have a negative
> >> effect on any of the other instances? Probably not, but just
> >> being careful. I will have these changes implemented during our
> >> nightly maintenance window will begin to gather relevant logs
> > first thing tomorrow!
> >>>
> >>> I can't think of any side effects.
> >>>
> >>
> >> --Eric
> >
> > Mark, the changes have been made per your instructions and tomcat
> > has been restarted. The debug0.log, and debug0.log.lck files were
> > created in the directory, but they currently both have 0 bytes.
> >>>
> >>> Drat. That suggests something isn't quite right as the logs should
> >>> start filling up as soon as a single request is made. I'll double
> >>> check my instructions if you could double check your end.
> >>
> >> I've clarified a few things in the instructions and confirmed they
> >> work with my local 7.0.72 build.
> >>
> >> Note: You will need to be using the BIO connector
> >>
> >
> > I had switched to the NIO connector last week. Is that why the logs are 
> > still
> at 0 bytes?
>
> Yes. I only added the debug logging to BIO.
>
> Somewhere in a previous thread I recommended switching back to BIO as
> the code is simpler and therefore easier to debug.
>
> Note that my previous analysis of the access log valve entries was based on
> the assumption you had switched back to BIO.
>
> Given everything else is in place, I'd suggest switching back to BIO when you
> can, waiting for the issue to re-appear and then looking at the debug logs.
>
> Mark
>
>

Oh man, I must have missed the recommendation to switch back to BIO. I will do 
that ASAP, most likely this evening. My apologies for causing the wasted effort.

> >
> >
> >> Mark
> >>
> >>>
> >>> Konstantin noted there was no source provided. I've pushed the
> >>> branch to
> >>> https://github.com/markt-asf/tomcat/tree/debug-7.0.72 so you can see
> >>> the changes I made.
> >>>
>  Also, RECYCLE_FACADES has been enabled and I confirmed that it is
> >> referenced in the logs as follows...
> 
>  INFO: Command line argument:
>  -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> >>>
> >>> Great.
> >>>
> >>> Mark
> >>>
> >>> 
> >>> - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> >>> For additional commands, e-mail: users-h...@tomcat.apache.org
> >>>
> >>
> >>
> >> --

Re: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Mark Thomas

On 27/10/2020 15:22, Eric Robinson wrote:
>> On 27/10/2020 09:16, Mark Thomas wrote:
>>> On 27/10/2020 04:43, Eric Robinson wrote:
>>>
>>> 
>>>
 Any changes in the Nginx configuration in the relevant timescale?

>>
>> The last change to the nginx config files was on 8/21. The first
>> report of problems from the users in question was on 9/16. There is
>> another set of users on a different tomcat instance who reported
>> issues around 8/26, 5 days after nginx config change. It seems
>> unlikely to be related. Also, I can't imagine what nginx could be
>> sending that would induce the upstream tomcat to behave this way.
>>>
>>> If there is some sort of retaining references to request/response/etc.
>>> at the root of this then that sort of issue is very sensitive to timing.
>>> For example, I have had reliable reproduction cases in the past that
>>> stopped working with the addition of a single debug statement. Any
>>> sort of configuration change might have changed the timing
>>> sufficiently to trigger the issue
>>>
>>> At this point, I'd say the Nginx config change might be a potential
>>> trigger if the root cause is retaining references.
>>>
 Any updates to the application in the relevant timescale?

>>
>> Their application was patched to a newer version on 6/5.
>>>
>>> That seems far enough away to be unlikely.
>>>
 Any features users started using that hadn't been used before in
 that timescale?
>>
>> That one I couldn't answer, as we are only the hosting facility and
>> we are not in the loop when it comes to the users' workflow, but it
>> seems unlikely given the nature of their business.
>>>
>>> Fair enough. That one was a bit of a shot in the dark.
>>>
>>> 
>>>
>> 1. Now that you have provided this patch, should I still enable
>> RECYCLE_FACADES=true?
>>>
>>> I'd recommend yes. At least until the issue is resolved.
>>>
>> 2. The servers in question are multi-tenanted. There are 17
>> instances of tomcat, each running on a different set of ports and
>> out of a separate directory tree, but they are all started with
>> essentially the same init script, which exports certain
>> site-specific path variables and executes tomcat from the specified
>> folder structure. Can you think of any potential issues where
>> making this change for one instance could have a negative effect on
>> any of the other instances? Probably not, but just being careful. I
>> will have these changes implemented during our nightly maintenance
>> window will begin to gather relevant logs
> first thing tomorrow!
>>>
>>> I can't think of any side effects.
>>>
>>
>> --Eric
>
> Mark, the changes have been made per your instructions and tomcat
> has been restarted. The debug0.log, and debug0.log.lck files were
> created in the directory, but they currently both have 0 bytes.
>>>
>>> Drat. That suggests something isn't quite right as the logs should
>>> start filling up as soon as a single request is made. I'll double
>>> check my instructions if you could double check your end.
>>
>> I've clarified a few things in the instructions and confirmed they work with
>> my local 7.0.72 build.
>>
>> Note: You will need to be using the BIO connector
>>
> 
> I had switched to the NIO connector last week. Is that why the logs are still 
> at 0 bytes?

Yes. I only added the debug logging to BIO.

Somewhere in a previous thread I recommended switching back to BIO as
the code is simpler and therefore easier to debug.

Note that my previous analysis of the access log valve entries was based
on the assumption you had switched back to BIO.

Given everything else is in place, I'd suggest switching back to BIO
when you can, waiting for the issue to re-appear and then looking at the
debug logs.

Mark


> 
> 
>> Mark
>>
>>>
>>> Konstantin noted there was no source provided. I've pushed the branch
>>> to
>>> https://github.com/markt-asf/tomcat/tree/debug-7.0.72 so you can see
>>> the changes I made.
>>>
 Also, RECYCLE_FACADES has been enabled and I confirmed that it is
>> referenced in the logs as follows...

 INFO: Command line argument:
 -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
>>>
>>> Great.
>>>
>>> Mark
>>>
>>> -
>>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>>> For additional commands, e-mail: users-h...@tomcat.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>> For additional commands, e-mail: users-h...@tomcat.apache.org
> 
> Disclaimer : This email and any files transmitted with it are confidential 
> and intended solely for intended recipients. If you are not the named 
> addressee you should not disseminate, distribute, copy or alter this email. 
> Any vie

RE: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Eric Robinson

> On 27/10/2020 09:16, Mark Thomas wrote:
> > On 27/10/2020 04:43, Eric Robinson wrote:
> >
> > 
> >
> >> Any changes in the Nginx configuration in the relevant timescale?
> >>
> 
>  The last change to the nginx config files was on 8/21. The first
>  report of problems from the users in question was on 9/16. There is
>  another set of users on a different tomcat instance who reported
>  issues around 8/26, 5 days after nginx config change. It seems
>  unlikely to be related. Also, I can't imagine what nginx could be
>  sending that would induce the upstream tomcat to behave this way.
> >
> > If there is some sort of retaining references to request/response/etc.
> > at the root of this then that sort of issue is very sensitive to timing.
> > For example, I have had reliable reproduction cases in the past that
> > stopped working with the addition of a single debug statement. Any
> > sort of configuration change might have changed the timing
> > sufficiently to trigger the issue
> >
> > At this point, I'd say the Nginx config change might be a potential
> > trigger if the root cause is retaining references.
> >
> >> Any updates to the application in the relevant timescale?
> >>
> 
>  Their application was patched to a newer version on 6/5.
> >
> > That seems far enough away to be unlikely.
> >
> >> Any features users started using that hadn't been used before in
> >> that timescale?
> 
>  That one I couldn't answer, as we are only the hosting facility and
>  we are not in the loop when it comes to the users' workflow, but it
>  seems unlikely given the nature of their business.
> >
> > Fair enough. That one was a bit of a shot in the dark.
> >
> > 
> >
>  1. Now that you have provided this patch, should I still enable
>  RECYCLE_FACADES=true?
> >
> > I'd recommend yes. At least until the issue is resolved.
> >
>  2. The servers in question are multi-tenanted. There are 17
>  instances of tomcat, each running on a different set of ports and
>  out of a separate directory tree, but they are all started with
>  essentially the same init script, which exports certain
>  site-specific path variables and executes tomcat from the specified
>  folder structure. Can you think of any potential issues where
>  making this change for one instance could have a negative effect on
>  any of the other instances? Probably not, but just being careful. I
>  will have these changes implemented during our nightly maintenance
>  window will begin to gather relevant logs
> >>> first thing tomorrow!
> >
> > I can't think of any side effects.
> >
> 
>  --Eric
> >>>
> >>> Mark, the changes have been made per your instructions and tomcat
> >>> has been restarted. The debug0.log, and debug0.log.lck files were
> >>> created in the directory, but they currently both have 0 bytes.
> >
> > Drat. That suggests something isn't quite right as the logs should
> > start filling up as soon as a single request is made. I'll double
> > check my instructions if you could double check your end.
>
> I've clarified a few things in the instructions and confirmed they work with
> my local 7.0.72 build.
>
> Note: You will need to be using the BIO connector
>

I had switched to the NIO connector last week. Is that why the logs are still 
at 0 bytes?


> Mark
>
> >
> > Konstantin noted there was no source provided. I've pushed the branch
> > to
> > https://github.com/markt-asf/tomcat/tree/debug-7.0.72 so you can see
> > the changes I made.
> >
> >> Also, RECYCLE_FACADES has been enabled and I confirmed that it is
> referenced in the logs as follows...
> >>
> >> INFO: Command line argument:
> >> -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> >
> > Great.
> >
> > Mark
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> > For additional commands, e-mail: users-h...@tomcat.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional comma

Re: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Mark Thomas

On 27/10/2020 09:16, Mark Thomas wrote:
> On 27/10/2020 04:43, Eric Robinson wrote:
> 
> 
> 
>> Any changes in the Nginx configuration in the relevant timescale?
>>

 The last change to the nginx config files was on 8/21. The first
 report of problems from the users in question was on 9/16. There is
 another set of users on a different tomcat instance who reported
 issues around 8/26, 5 days after nginx config change. It seems
 unlikely to be related. Also, I can't imagine what nginx could be
 sending that would induce the upstream tomcat to behave this way.
> 
> If there is some sort of retaining references to request/response/etc.
> at the root of this then that sort of issue is very sensitive to timing.
> For example, I have had reliable reproduction cases in the past that
> stopped working with the addition of a single debug statement. Any sort
> of configuration change might have changed the timing sufficiently to
> trigger the issue
> 
> At this point, I'd say the Nginx config change might be a potential
> trigger if the root cause is retaining references.
> 
>> Any updates to the application in the relevant timescale?
>>

 Their application was patched to a newer version on 6/5.
> 
> That seems far enough away to be unlikely.
> 
>> Any features users started using that hadn't been used before in
>> that timescale?

 That one I couldn't answer, as we are only the hosting facility and we
 are not in the loop when it comes to the users' workflow, but it seems
 unlikely given the nature of their business.
> 
> Fair enough. That one was a bit of a shot in the dark.
> 
> 
> 
 1. Now that you have provided this patch, should I still enable
 RECYCLE_FACADES=true?
> 
> I'd recommend yes. At least until the issue is resolved.
> 
 2. The servers in question are multi-tenanted. There are 17 instances
 of tomcat, each running on a different set of ports and out of a
 separate directory tree, but they are all started with essentially the
 same init script, which exports certain site-specific path variables
 and executes tomcat from the specified folder structure. Can you think
 of any potential issues where making this change for one instance
 could have a negative effect on any of the other instances? Probably
 not, but just being careful. I will have these changes implemented
 during our nightly maintenance window will begin to gather relevant logs
>>> first thing tomorrow!
> 
> I can't think of any side effects.
> 

 --Eric
>>>
>>> Mark, the changes have been made per your instructions and tomcat has
>>> been restarted. The debug0.log, and debug0.log.lck files were created in the
>>> directory, but they currently both have 0 bytes.
> 
> Drat. That suggests something isn't quite right as the logs should start
> filling up as soon as a single request is made. I'll double check my
> instructions if you could double check your end.

I've clarified a few things in the instructions and confirmed they work
with my local 7.0.72 build.

Note: You will need to be using the BIO connector

Mark

> 
> Konstantin noted there was no source provided. I've pushed the branch to
> https://github.com/markt-asf/tomcat/tree/debug-7.0.72 so you can see the
> changes I made.
> 
>> Also, RECYCLE_FACADES has been enabled and I confirmed that it is referenced 
>> in the logs as follows...
>>
>> INFO: Command line argument: 
>> -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> 
> Great.
> 
> Mark
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-10-27 Thread Mark Thomas

On 27/10/2020 04:43, Eric Robinson wrote:

> Any changes in the Nginx configuration in the relevant timescale?
>
>>>
>>> The last change to the nginx config files was on 8/21. The first
>>> report of problems from the users in question was on 9/16. There is
>>> another set of users on a different tomcat instance who reported
>>> issues around 8/26, 5 days after nginx config change. It seems
>>> unlikely to be related. Also, I can't imagine what nginx could be
>>> sending that would induce the upstream tomcat to behave this way.

If there is some sort of retaining references to request/response/etc.
at the root of this then that sort of issue is very sensitive to timing.
For example, I have had reliable reproduction cases in the past that
stopped working with the addition of a single debug statement. Any sort
of configuration change might have changed the timing sufficiently to
trigger the issue

At this point, I'd say the Nginx config change might be a potential
trigger if the root cause is retaining references.

> Any updates to the application in the relevant timescale?
>
>>>
>>> Their application was patched to a newer version on 6/5.

That seems far enough away to be unlikely.

> Any features users started using that hadn't been used before in
> that timescale?
>>>
>>> That one I couldn't answer, as we are only the hosting facility and we
>>> are not in the loop when it comes to the users' workflow, but it seems
>>> unlikely given the nature of their business.

Fair enough. That one was a bit of a shot in the dark.

>>> 1. Now that you have provided this patch, should I still enable
>>> RECYCLE_FACADES=true?

I'd recommend yes. At least until the issue is resolved.

>>> 2. The servers in question are multi-tenanted. There are 17 instances
>>> of tomcat, each running on a different set of ports and out of a
>>> separate directory tree, but they are all started with essentially the
>>> same init script, which exports certain site-specific path variables
>>> and executes tomcat from the specified folder structure. Can you think
>>> of any potential issues where making this change for one instance
>>> could have a negative effect on any of the other instances? Probably
>>> not, but just being careful. I will have these changes implemented
>>> during our nightly maintenance window will begin to gather relevant logs
>> first thing tomorrow!

I can't think of any side effects.

>>>
>>> --Eric
>>
>> Mark, the changes have been made per your instructions and tomcat has
>> been restarted. The debug0.log, and debug0.log.lck files were created in the
>> directory, but they currently both have 0 bytes.

Drat. That suggests something isn't quite right as the logs should start
filling up as soon as a single request is made. I'll double check my
instructions if you could double check your end.

Konstantin noted there was no source provided. I've pushed the branch to
https://github.com/markt-asf/tomcat/tree/debug-7.0.72 so you can see the
changes I made.

> Also, RECYCLE_FACADES has been enabled and I confirmed that it is referenced 
> in the logs as follows...
> 
> INFO: Command line argument: 
> -Dorg.apache.catalina.connector.RECYCLE_FACADES=true

Great.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Eric Robinson

> -Original Message-
> From: Eric Robinson 
> Sent: Monday, October 26, 2020 11:37 PM
> To: Tomcat Users List 
> Subject: RE: Weirdest Tomcat Behavior Ever?
>
> > > On 26/10/2020 10:26, Mark Thomas wrote:
> > > > On 24/10/2020 01:32, Eric Robinson wrote:
> > > >
> > > > 
> > > >
> > > >>>> -Original Message-
> > > >>>> From: Mark Thomas 
> > > >
> > > > 
> > > >
> > > >>>> The failed request:
> > > >>>> - Completes in ~6ms
> > > >>>
> > > >>> I think we've seen the failed requests take as much as 50ms.
> > > >
> > > > Ack. That is still orders of magnitude smaller that the timeout
> > > > and consistent with generation time of some of the larger responses.
> > > >
> > > > I wouldn't sat it confirms any of my previous conclusions but it
> > > > doesn't invalidate them either.
> > > >
> > > >>>> Follow-up questions:
> > > >>>> - JVM
> > > >>>>   - Vendor?
> > > >>>>   - OS package or direct from Vendor?
> > > >>>>
> > > >>>
> > > >>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> > > >
> > > > OK. That is post Java 8u202 so it should be a paid for,
> > > > commercially supported version of Java 8.
> > > >
> > > > The latest Java 8 release from Oracle is 8u271.
> > > >
> > > > The latest Java 8 release from AdoptOpenJDK is 8u272.
> > > >
> > > > I don't think we are quite at this point yet but what is your view
> > > > on updating to the latest Java 8 JDK (from either Oracle or
> > AdoptOpenJDK).
> > > >
> > > >>>> - Tomcat
> > > >>>>   - OS package, 3rd-party package or direct from ASF?
> > > >>>>
> > > >>>
> > > >>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
> > > >>>
> > > >>
> > > >> Drat, slight correction. I now recall that although we initially
> > > >> installed 7.0.76
> > > from the CentOS repo, the application vendor made us lower the
> > > version to 7.0.72, and I DO NOT know where we got that. However, it
> > > has not changed since October-ish of 2018.
> > > >
> > > > I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> > > > relevant changes.
> > > >
> > > >>>> - Config
> > > >>>>   - Any changes at all around the time the problems started? I'm
> > > >>>> thinking OS updates, VM restarted etc?
> > > >>>>
> > > >>>
> > > >>> server.xml has not changed since 4/20/20, which was well before
> > > >>> the problem manifested, and all the other files in the conf
> > > >>> folder are even older than that. We're seeing this symptom on
> > > >>> both production servers. One of them was rebooted a week ago,
> > > >>> but the other has been up continuously for
> > > >>> 258 days.
> > > >
> > > > OK. That rules a few things out which is good but it does make the
> > > > trigger for this issue even more mysterious.
> > > >
> > > > Any changes in the Nginx configuration in the relevant timescale?
> > > >
> >
> > The last change to the nginx config files was on 8/21. The first
> > report of problems from the users in question was on 9/16. There is
> > another set of users on a different tomcat instance who reported
> > issues around 8/26, 5 days after nginx config change. It seems
> > unlikely to be related. Also, I can't imagine what nginx could be
> > sending that would induce the upstream tomcat to behave this way.
> >
> > > > Any updates to the application in the relevant timescale?
> > > >
> >
> > Their application was patched to a newer version on 6/5.
> >
> > > > Any features users started using that hadn't been used before in
> > > > that timescale?
> >
> > That one I couldn't answer, as we are only the hosting facility and we
> > are not in the loop when it comes to the users' workflow, but it seems
> > unlikely given the nature of their business

RE: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Eric Robinson

> > On 26/10/2020 10:26, Mark Thomas wrote:
> > > On 24/10/2020 01:32, Eric Robinson wrote:
> > >
> > > 
> > >
> >  -Original Message-
> >  From: Mark Thomas 
> > >
> > > 
> > >
> >  The failed request:
> >  - Completes in ~6ms
> > >>>
> > >>> I think we've seen the failed requests take as much as 50ms.
> > >
> > > Ack. That is still orders of magnitude smaller that the timeout and
> > > consistent with generation time of some of the larger responses.
> > >
> > > I wouldn't sat it confirms any of my previous conclusions but it
> > > doesn't invalidate them either.
> > >
> >  Follow-up questions:
> >  - JVM
> >    - Vendor?
> >    - OS package or direct from Vendor?
> > 
> > >>>
> > >>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> > >
> > > OK. That is post Java 8u202 so it should be a paid for, commercially
> > > supported version of Java 8.
> > >
> > > The latest Java 8 release from Oracle is 8u271.
> > >
> > > The latest Java 8 release from AdoptOpenJDK is 8u272.
> > >
> > > I don't think we are quite at this point yet but what is your view
> > > on updating to the latest Java 8 JDK (from either Oracle or
> AdoptOpenJDK).
> > >
> >  - Tomcat
> >    - OS package, 3rd-party package or direct from ASF?
> > 
> > >>>
> > >>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
> > >>>
> > >>
> > >> Drat, slight correction. I now recall that although we initially
> > >> installed 7.0.76
> > from the CentOS repo, the application vendor made us lower the version
> > to 7.0.72, and I DO NOT know where we got that. However, it has not
> > changed since October-ish of 2018.
> > >
> > > I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> > > relevant changes.
> > >
> >  - Config
> >    - Any changes at all around the time the problems started? I'm
> >  thinking OS updates, VM restarted etc?
> > 
> > >>>
> > >>> server.xml has not changed since 4/20/20, which was well before
> > >>> the problem manifested, and all the other files in the conf folder
> > >>> are even older than that. We're seeing this symptom on both
> > >>> production servers. One of them was rebooted a week ago, but the
> > >>> other has been up continuously for
> > >>> 258 days.
> > >
> > > OK. That rules a few things out which is good but it does make the
> > > trigger for this issue even more mysterious.
> > >
> > > Any changes in the Nginx configuration in the relevant timescale?
> > >
>
> The last change to the nginx config files was on 8/21. The first report of
> problems from the users in question was on 9/16. There is another set of
> users on a different tomcat instance who reported issues around 8/26, 5 days
> after nginx config change. It seems unlikely to be related. Also, I can't
> imagine what nginx could be sending that would induce the upstream tomcat
> to behave this way.
>
> > > Any updates to the application in the relevant timescale?
> > >
>
> Their application was patched to a newer version on 6/5.
>
> > > Any features users started using that hadn't been used before in
> > > that timescale?
>
> That one I couldn't answer, as we are only the hosting facility and we are not
> in the loop when it comes to the users' workflow, but it seems unlikely given
> the nature of their business.
>
> > >
> > > 
> > >
> >  Recommendations:
> >  - Switch back to the BIO connector if you haven't already. It has fewer
> >    moving parts than NIO so it is simpler debug.
> >  - Add "%b" to the access log pattern for Tomcat's access log valve to
> >    record the number of body (excluding headers) bytes Tomcat
> >  believes
> > it
> >    has written to the response.
> > 
> > 
> >  Next steps:
> >  - Wait for the issue to re-occur after the recommended changes
> >  above
> > and
> >    depending on what is recorded in the access log for %b for a failed
> >    request, shift the focus accordingly.
> >  - Answers to the additional questions would be nice but the access
> log
> >    %b value for a failed request is the key piece of information 
> >  required
> >    at this point.
> > 
> > >>>
> > >>> Good news! I enabled that parameter a few days ago and we have
> > >>> already caught some instances of the problem occurring.
> > >
> > > Excellent!
> > >
> > >>> Here is the logging format...
> > >>>
> > >>>  > >>> directory="logs"
> > >>>prefix="localhost_access." suffix=".log"
> > >>> pattern="%h %l %D %u %t %{JSESSIONID}c %{cookie}i %r %s %b %S
> %q"
> > >>> />
> > >>>
> > >>> Due to some sensitive content in the HTTP requests below, I have
> > >>> globally replaced certain words and addresses with random-ish
> > >>> strings, but I don't think I've done anything to impact the issue.
> > >>>
> > >>> Following is an example from Wednesday.
> > >>>
> > >>> This is a request being sent from the nginx proxy to the first of
> > >>> 2 upstream servers, 10.

Re: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Konstantin Kolinko

вт, 27 окт. 2020 г. в 00:07, Eric Robinson :
>
> > On 26/10/2020 10:26, Mark Thomas wrote:
> > > On 24/10/2020 01:32, Eric Robinson wrote:
> > >
> > > At this point I'd echo Konstantin's recommendation to add the
> > > following system property:
> > > org.apache.catalina.connector.RECYCLE_FACADES=true
> > >
> > > You'd normally do this in $CATALINA_HOME/bin/setenv.sh (creating that
> > > file if necessary) with a line like:
> > >
> > > CATALINA_OPTS="$CATALINA_OPTS
> > > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true"
> > >
> > > You can confirm that the setting has been applied by looking in the
> > > log for the start-up. You should see something like:
> > >
> > > Oct 26, 2020 10:18:45 AM
> > > org.apache.catalina.startup.VersionLoggerListener log
> > > INFO: Command line argument:
> > > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> > >
> > >
> > > That option reduces the re-use of request, response and related
> > > objects between requests and, if an application is retaining
> > > references it shouldn't, you usually see a bunch of
> > > NullPointerExceptions in the logs when the application tries to re-use 
> > > those
> > objects.
> > >
> > > Meanwhile, I'm going to work on a custom patch for 7.0.72 to add some
> > > additional logging around the network writes.
> >
> > Patch files and instructions for use:
> >
> > http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v1/
> >
> > Mark
>
> Hi Mark,
>
> A couple of questions.
>
> 1. Now that you have provided this patch, should I still enable 
> RECYCLE_FACADES=true?

Regarding the patch,
there is no source code for it, but I think that it adds debug
logging, nothing more.

RECYCLE_FACADES makes your configuration more safe, protecting Tomcat
from misbehaving web applications. I have that property set on all
Tomcat installations that I care about. Thus I think that you should
set it anyway.

I usually add that property into the conf/catalina.property file.

See the wiki for a more detailed answer.
https://cwiki.apache.org/confluence/display/TOMCAT/Troubleshooting+and+Diagnostics#TroubleshootingandDiagnostics-TroubleshootingunexpectedResponsestateproblems

My thought that you case could be caused by something like the "Java
ImageIO" issue mentioned there. If something in the web application
produces dangling references to java.io.OutputStream and they are
closed during garbage collection, corrupting Tomcat internals.

> 2. [...] Can you think of any potential issues where making this change for 
> one instance could have a negative effect on any of the other instances? 
> Probably not, but just being careful.

I hope that you can cope with the amount of logging that this generates.

Best regards,
Konstantin Kolinko

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Eric Robinson

> On 26/10/2020 10:26, Mark Thomas wrote:
> > On 24/10/2020 01:32, Eric Robinson wrote:
> >
> > 
> >
>  -Original Message-
>  From: Mark Thomas 
> >
> > 
> >
>  The failed request:
>  - Completes in ~6ms
> >>>
> >>> I think we've seen the failed requests take as much as 50ms.
> >
> > Ack. That is still orders of magnitude smaller that the timeout and
> > consistent with generation time of some of the larger responses.
> >
> > I wouldn't sat it confirms any of my previous conclusions but it
> > doesn't invalidate them either.
> >
>  Follow-up questions:
>  - JVM
>    - Vendor?
>    - OS package or direct from Vendor?
> 
> >>>
> >>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> >
> > OK. That is post Java 8u202 so it should be a paid for, commercially
> > supported version of Java 8.
> >
> > The latest Java 8 release from Oracle is 8u271.
> >
> > The latest Java 8 release from AdoptOpenJDK is 8u272.
> >
> > I don't think we are quite at this point yet but what is your view on
> > updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK).
> >
>  - Tomcat
>    - OS package, 3rd-party package or direct from ASF?
> 
> >>>
> >>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
> >>>
> >>
> >> Drat, slight correction. I now recall that although we initially installed 
> >> 7.0.76
> from the CentOS repo, the application vendor made us lower the version to
> 7.0.72, and I DO NOT know where we got that. However, it has not changed
> since October-ish of 2018.
> >
> > I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> > relevant changes.
> >
>  - Config
>    - Any changes at all around the time the problems started? I'm
>  thinking OS updates, VM restarted etc?
> 
> >>>
> >>> server.xml has not changed since 4/20/20, which was well before the
> >>> problem manifested, and all the other files in the conf folder are
> >>> even older than that. We're seeing this symptom on both production
> >>> servers. One of them was rebooted a week ago, but the other has been
> >>> up continuously for
> >>> 258 days.
> >
> > OK. That rules a few things out which is good but it does make the
> > trigger for this issue even more mysterious.
> >
> > Any changes in the Nginx configuration in the relevant timescale?
> >

The last change to the nginx config files was on 8/21. The first report of 
problems from the users in question was on 9/16. There is another set of users 
on a different tomcat instance who reported issues around 8/26, 5 days after 
nginx config change. It seems unlikely to be related. Also, I can't imagine 
what nginx could be sending that would induce the upstream tomcat to behave 
this way.

> > Any updates to the application in the relevant timescale?
> >

Their application was patched to a newer version on 6/5.

> > Any features users started using that hadn't been used before in that
> > timescale?

That one I couldn't answer, as we are only the hosting facility and we are not 
in the loop when it comes to the users' workflow, but it seems unlikely given 
the nature of their business.

> >
> > 
> >
>  Recommendations:
>  - Switch back to the BIO connector if you haven't already. It has fewer
>    moving parts than NIO so it is simpler debug.
>  - Add "%b" to the access log pattern for Tomcat's access log valve to
>    record the number of body (excluding headers) bytes Tomcat believes
> it
>    has written to the response.
> 
> 
>  Next steps:
>  - Wait for the issue to re-occur after the recommended changes above
> and
>    depending on what is recorded in the access log for %b for a failed
>    request, shift the focus accordingly.
>  - Answers to the additional questions would be nice but the access log
>    %b value for a failed request is the key piece of information required
>    at this point.
> 
> >>>
> >>> Good news! I enabled that parameter a few days ago and we have
> >>> already caught some instances of the problem occurring.
> >
> > Excellent!
> >
> >>> Here is the logging format...
> >>>
> >>>  >>> directory="logs"
> >>>prefix="localhost_access." suffix=".log" pattern="%h
> >>> %l %D %u %t %{JSESSIONID}c %{cookie}i %r %s %b %S %q" />
> >>>
> >>> Due to some sensitive content in the HTTP requests below, I have
> >>> globally replaced certain words and addresses with random-ish
> >>> strings, but I don't think I've done anything to impact the issue.
> >>>
> >>> Following is an example from Wednesday.
> >>>
> >>> This is a request being sent from the nginx proxy to the first of 2
> >>> upstream servers, 10.51.14.46
> >>>
> >>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream
> >>> prematurely closed connection while reading response header from
> upstream, client:
> >>> 99.88.77.66, server: redacted.domain.com, request: "GET
> >>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-

Re: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Mark Thomas

On 26/10/2020 10:26, Mark Thomas wrote:
> On 24/10/2020 01:32, Eric Robinson wrote:
> 
> 
> 
 -Original Message-
 From: Mark Thomas 
> 
> 
> 
 The failed request:
 - Completes in ~6ms
>>>
>>> I think we've seen the failed requests take as much as 50ms.
> 
> Ack. That is still orders of magnitude smaller that the timeout and
> consistent with generation time of some of the larger responses.
> 
> I wouldn't sat it confirms any of my previous conclusions but it doesn't
> invalidate them either.
> 
 Follow-up questions:
 - JVM
   - Vendor?
   - OS package or direct from Vendor?

>>>
>>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> 
> OK. That is post Java 8u202 so it should be a paid for, commercially
> supported version of Java 8.
> 
> The latest Java 8 release from Oracle is 8u271.
> 
> The latest Java 8 release from AdoptOpenJDK is 8u272.
> 
> I don't think we are quite at this point yet but what is your view on
> updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK).
> 
 - Tomcat
   - OS package, 3rd-party package or direct from ASF?

>>>
>>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
>>>
>>
>> Drat, slight correction. I now recall that although we initially installed 
>> 7.0.76 from the CentOS repo, the application vendor made us lower the 
>> version to 7.0.72, and I DO NOT know where we got that. However, it has not 
>> changed since October-ish of 2018.
> 
> I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> relevant changes.
> 
 - Config
   - Any changes at all around the time the problems started? I'm
 thinking OS updates, VM restarted etc?

>>>
>>> server.xml has not changed since 4/20/20, which was well before the
>>> problem manifested, and all the other files in the conf folder are even 
>>> older
>>> than that. We're seeing this symptom on both production servers. One of
>>> them was rebooted a week ago, but the other has been up continuously for
>>> 258 days.
> 
> OK. That rules a few things out which is good but it does make the
> trigger for this issue even more mysterious.
> 
> Any changes in the Nginx configuration in the relevant timescale?
> 
> Any updates to the application in the relevant timescale?
> 
> Any features users started using that hadn't been used before in that
> timescale?
> 
> 
> 
 Recommendations:
 - Switch back to the BIO connector if you haven't already. It has fewer
   moving parts than NIO so it is simpler debug.
 - Add "%b" to the access log pattern for Tomcat's access log valve to
   record the number of body (excluding headers) bytes Tomcat believes it
   has written to the response.


 Next steps:
 - Wait for the issue to re-occur after the recommended changes above and
   depending on what is recorded in the access log for %b for a failed
   request, shift the focus accordingly.
 - Answers to the additional questions would be nice but the access log
   %b value for a failed request is the key piece of information required
   at this point.

>>>
>>> Good news! I enabled that parameter a few days ago and we have already
>>> caught some instances of the problem occurring.
> 
> Excellent!
> 
>>> Here is the logging format...
>>>
>>> >> directory="logs"
>>>prefix="localhost_access." suffix=".log" pattern="%h %l %D 
>>> %u %t
>>> %{JSESSIONID}c %{cookie}i %r %s %b %S %q" />
>>>
>>> Due to some sensitive content in the HTTP requests below, I have globally
>>> replaced certain words and addresses with random-ish strings, but I don't
>>> think I've done anything to impact the issue.
>>>
>>> Following is an example from Wednesday.
>>>
>>> This is a request being sent from the nginx proxy to the first of 2 upstream
>>> servers, 10.51.14.46
>>>
>>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely
>>> closed connection while reading response header from upstream, client:
>>> 99.88.77.66, server: redacted.domain.com, request: "GET
>>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846×tamp=21102020155122.472656 HTTP/1.1", upstream:
>>> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=20
>>> 20-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846×tamp=21102020155122.472656", host:
>>> "redacted.domain.com"
>>>
>>> Here is the matching localhost_access log entry from that server….
>>>
>>> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400]
>>> F405E25E49E3DCB81A36A87DED1FE573
>>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>>> srv_id=dea8d61a7d725e980a6093cb78d8ec73;
>>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>>> srv_id=dea8d61a7d725e980a6093cb78d8e

Re: Weirdest Tomcat Behavior Ever?

2020-10-26 Thread Mark Thomas

On 24/10/2020 01:32, Eric Robinson wrote:



>>> -Original Message-
>>> From: Mark Thomas 



>>> The failed request:
>>> - Completes in ~6ms
>>
>> I think we've seen the failed requests take as much as 50ms.

Ack. That is still orders of magnitude smaller that the timeout and
consistent with generation time of some of the larger responses.

I wouldn't sat it confirms any of my previous conclusions but it doesn't
invalidate them either.

>>> Follow-up questions:
>>> - JVM
>>>   - Vendor?
>>>   - OS package or direct from Vendor?
>>>
>>
>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.

OK. That is post Java 8u202 so it should be a paid for, commercially
supported version of Java 8.

The latest Java 8 release from Oracle is 8u271.

The latest Java 8 release from AdoptOpenJDK is 8u272.

I don't think we are quite at this point yet but what is your view on
updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK).

>>> - Tomcat
>>>   - OS package, 3rd-party package or direct from ASF?
>>>
>>
>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
>>
> 
> Drat, slight correction. I now recall that although we initially installed 
> 7.0.76 from the CentOS repo, the application vendor made us lower the version 
> to 7.0.72, and I DO NOT know where we got that. However, it has not changed 
> since October-ish of 2018.

I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
relevant changes.

>>> - Config
>>>   - Any changes at all around the time the problems started? I'm
>>> thinking OS updates, VM restarted etc?
>>>
>>
>> server.xml has not changed since 4/20/20, which was well before the
>> problem manifested, and all the other files in the conf folder are even older
>> than that. We're seeing this symptom on both production servers. One of
>> them was rebooted a week ago, but the other has been up continuously for
>> 258 days.

OK. That rules a few things out which is good but it does make the
trigger for this issue even more mysterious.

Any changes in the Nginx configuration in the relevant timescale?

Any updates to the application in the relevant timescale?

Any features users started using that hadn't been used before in that
timescale?



>>> Recommendations:
>>> - Switch back to the BIO connector if you haven't already. It has fewer
>>>   moving parts than NIO so it is simpler debug.
>>> - Add "%b" to the access log pattern for Tomcat's access log valve to
>>>   record the number of body (excluding headers) bytes Tomcat believes it
>>>   has written to the response.
>>>
>>>
>>> Next steps:
>>> - Wait for the issue to re-occur after the recommended changes above and
>>>   depending on what is recorded in the access log for %b for a failed
>>>   request, shift the focus accordingly.
>>> - Answers to the additional questions would be nice but the access log
>>>   %b value for a failed request is the key piece of information required
>>>   at this point.
>>>
>>
>> Good news! I enabled that parameter a few days ago and we have already
>> caught some instances of the problem occurring.

Excellent!

>> Here is the logging format...
>>
>> > directory="logs"
>>prefix="localhost_access." suffix=".log" pattern="%h %l %D %u 
>> %t
>> %{JSESSIONID}c %{cookie}i %r %s %b %S %q" />
>>
>> Due to some sensitive content in the HTTP requests below, I have globally
>> replaced certain words and addresses with random-ish strings, but I don't
>> think I've done anything to impact the issue.
>>
>> Following is an example from Wednesday.
>>
>> This is a request being sent from the nginx proxy to the first of 2 upstream
>> servers, 10.51.14.46
>>
>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely
>> closed connection while reading response header from upstream, client:
>> 99.88.77.66, server: redacted.domain.com, request: "GET
>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>> 5650846×tamp=21102020155122.472656 HTTP/1.1", upstream:
>> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=20
>> 20-10-
>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>> 5650846×tamp=21102020155122.472656", host:
>> "redacted.domain.com"
>>
>> Here is the matching localhost_access log entry from that server….
>>
>> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400]
>> F405E25E49E3DCB81A36A87DED1FE573
>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>> srv_id=dea8d61a7d725e980a6093cb78d8ec73;
>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>> srv_id=dea8d61a7d725e980a6093cb78d8ec73 GET
>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>> 5650846×tamp=21

Re: Weirdest Tomcat Behavior Ever?

2020-10-24 Thread Konstantin Kolinko

сб, 24 окт. 2020 г. в 03:08, Eric Robinson :
>
> > - Answers to the additional questions would be nice but the access log
> >   %b value for a failed request is the key piece of information required
> >   at this point.
> >
>
> Good news! I enabled that parameter a few days ago and we have already caught 
> some instances of the problem occurring. Here is the logging format...
>
> prefix="localhost_access." suffix=".log" pattern="%h %l %D %u 
> %t %{JSESSIONID}c %{cookie}i %r %s %b %S %q" />

I usually also have the following pattern in my access logs:
[%{org.apache.catalina.parameter_parse_failed}r
%{org.apache.catalina.parameter_parse_failed_reason}r]

It captures whether a request.getParameter() call completed abnormally and why.
That is unlikely your cause, but there may be some other request
attributes that may be of interest, but nothing comes to mind at the
moment.

> Due to some sensitive content in the HTTP requests below, I have globally 
> replaced certain words and addresses with random-ish strings, but I don't 
> think I've done anything to impact the issue.
>
> Following is an example from Wednesday.
>
> This is a request being sent from the nginx proxy to the first of 2 upstream 
> servers, 10.51.14.46
>
> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely 
> closed connection while reading response header from upstream, client: 
> 99.88.77.66, server: redacted.domain.com, request: "GET 
> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID=0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.5650846×tamp=21102020155122.472656
>  HTTP/1.1", upstream: 
> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID=0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.5650846×tamp=21102020155122.472656";,
>  host: "redacted.domain.com"

I wonder what "CompressXML=Yes" does.


I also recommend to add the following system property to the configuration:
org.apache.catalina.connector.RECYCLE_FACADES=true

https://cwiki.apache.org/confluence/display/TOMCAT/Troubleshooting+and+Diagnostics#TroubleshootingandDiagnostics-TroubleshootingunexpectedResponsestateproblems

> Drat, slight correction. I now recall that although we initially installed 
> 7.0.76 from the CentOS repo, the application vendor made us lower the version 
> to 7.0.72, and I DO NOT know where we got that.

It is possible to compare the files with pristine version from ASF
archive dist site,
https://archive.apache.org/dist/tomcat/tomcat-7/

If it was installed as an RPM package, `rpm -qf` should be able to
answer what installed package those files belong to.
https://stackoverflow.com/questions/1133495/how-do-i-find-which-rpm-package-supplies-a-file-im-looking-for

Best regards,
Konstantin Kolinko

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-23 Thread Eric Robinson

> -Original Message-
> From: Eric Robinson 
> Sent: Friday, October 23, 2020 7:09 PM
> To: Tomcat Users List 
> Subject: RE: Weirdest Tomcat Behavior Ever?
>
> Hi Mark --
>
> Thanks tons for digging into this. See my answers below.
>
> > -Original Message-
> > From: Mark Thomas 
> > Sent: Friday, October 23, 2020 5:09 PM
> > To: users@tomcat.apache.org
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > Hi Eric (and those following along),
> >
> > Eric sent me some network captures off-list from multiple points in
> > the network from the system where this is happening. The following is
> > the summary so far:
> >
> > Overview:
> > A small number of requests are receiving a completely empty (no
> > headers, no body) response.
> >
> >
> > Information Gathering:
> > Successful requests that are similar to the failed request:
> > - Take 7ms to 13ms to complete
> > - Have relatively small responses (~1k)
> > - Use HTTP/1.0
> > - Do not use keep-alive (request has Connection: close header)
> > - The request target is a JSP
> >
> > The failed request:
> > - Completes in ~6ms
>
> I think we've seen the failed requests take as much as 50ms.
>
> > - Has no response headers or body
> > - Records a successful entry in the access log
> >
> > System:
> > Tomcat 7.0.76, BIO HTTP connector
> > Java 1.8.0_221,
> > CentOS 7.5 server running in Azure
> >
> > Requests are received from an nginx reverse proxy. It looks like nginx
> > is using
> > HTTP/1.0 without keep-alive to proxy requests to Tomcat. This actually
> > makes things a little easier as we have one TCP/IP connection per request.
> >
> > After switching to NIO, the issue is still observable (info received
> > off-list along with access to network traces etc.).
> >
> > The issue appeared ~1 months ago after running without error since
> > October 2018. No known changes were made ~1 month ago.
> >
> > The TCP sequence numbers show that, as far as the network stack is
> > concerned, Tomcat did not write any data before closing the connection
> > cleanly.
> >
> > There is no other activity on the client port associated with the
> > failed request in the provided trace.
> >
> > The failed request does not appear to violate the HTTP specification.
> >
> > Neither the request nor the response are compressed.
> >
> > No WebSocket or other HTTP upgrade requests present in the network
> > traces.
> >
> > No obviously relevant bugs fixed since 7.0.76.
> >
> >
> > Follow-up questions:
> > - JVM
> >   - Vendor?
> >   - OS package or direct from Vendor?
> >
>
> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
>
>
> > - Tomcat
> >   - OS package, 3rd-party package or direct from ASF?
> >
>
> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
>

Drat, slight correction. I now recall that although we initially installed 
7.0.76 from the CentOS repo, the application vendor made us lower the version 
to 7.0.72, and I DO NOT know where we got that. However, it has not changed 
since October-ish of 2018.

> > - Config
> >   - Any changes at all around the time the problems started? I'm
> > thinking OS updates, VM restarted etc?
> >
>
> server.xml has not changed since 4/20/20, which was well before the
> problem manifested, and all the other files in the conf folder are even older
> than that. We're seeing this symptom on both production servers. One of
> them was rebooted a week ago, but the other has been up continuously for
> 258 days.
>
> >
> > Conclusions:
> > - It isn't timeout related. The request is completing in orders of
> >   magnitude less time than the timeout.
> >
> > - Based on the timings it looks like the JSP is processing the request
> >   and generating the response.
> >
> > - It happens with BIO so sendfile isn't a factor.
> >
> > - No compression so GZIP issues aren't a factor.
> >
> > - Given that the issue occurs with both BIO and NIO that rules out a bug
> >   in any BIO or NIO specific code. Note that while 7.0.x has largely
> >   separate code for BIO and NIO there are still significant sections of
> >   code that are essentially identical so it isn't quite as simple as
> >   just ruling out all the code in the BIO and NIO specific classes.
> >   It also makes a JVM issue seem less likely at this point.
> >
> >
> > Current thin

RE: Weirdest Tomcat Behavior Ever?

2020-10-23 Thread Eric Robinson

Hi Mark --

Thanks tons for digging into this. See my answers below.

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, October 23, 2020 5:09 PM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Hi Eric (and those following along),
>
> Eric sent me some network captures off-list from multiple points in the
> network from the system where this is happening. The following is the
> summary so far:
>
> Overview:
> A small number of requests are receiving a completely empty (no headers,
> no body) response.
>
>
> Information Gathering:
> Successful requests that are similar to the failed request:
> - Take 7ms to 13ms to complete
> - Have relatively small responses (~1k)
> - Use HTTP/1.0
> - Do not use keep-alive (request has Connection: close header)
> - The request target is a JSP
>
> The failed request:
> - Completes in ~6ms

I think we've seen the failed requests take as much as 50ms.

> - Has no response headers or body
> - Records a successful entry in the access log
>
> System:
> Tomcat 7.0.76, BIO HTTP connector
> Java 1.8.0_221,
> CentOS 7.5 server running in Azure
>
> Requests are received from an nginx reverse proxy. It looks like nginx is 
> using
> HTTP/1.0 without keep-alive to proxy requests to Tomcat. This actually
> makes things a little easier as we have one TCP/IP connection per request.
>
> After switching to NIO, the issue is still observable (info received off-list 
> along
> with access to network traces etc.).
>
> The issue appeared ~1 months ago after running without error since October
> 2018. No known changes were made ~1 month ago.
>
> The TCP sequence numbers show that, as far as the network stack is
> concerned, Tomcat did not write any data before closing the connection
> cleanly.
>
> There is no other activity on the client port associated with the failed 
> request
> in the provided trace.
>
> The failed request does not appear to violate the HTTP specification.
>
> Neither the request nor the response are compressed.
>
> No WebSocket or other HTTP upgrade requests present in the network
> traces.
>
> No obviously relevant bugs fixed since 7.0.76.
>
>
> Follow-up questions:
> - JVM
>   - Vendor?
>   - OS package or direct from Vendor?
>

jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.


> - Tomcat
>   - OS package, 3rd-party package or direct from ASF?
>

tomcat.noarch  7.0.76-6.el7 from CentOS base repository

> - Config
>   - Any changes at all around the time the problems started? I'm
> thinking OS updates, VM restarted etc?
>

server.xml has not changed since 4/20/20, which was well before the problem 
manifested, and all the other files in the conf folder are even older than 
that. We're seeing this symptom on both production servers. One of them was 
rebooted a week ago, but the other has been up continuously for 258 days.

>
> Conclusions:
> - It isn't timeout related. The request is completing in orders of
>   magnitude less time than the timeout.
>
> - Based on the timings it looks like the JSP is processing the request
>   and generating the response.
>
> - It happens with BIO so sendfile isn't a factor.
>
> - No compression so GZIP issues aren't a factor.
>
> - Given that the issue occurs with both BIO and NIO that rules out a bug
>   in any BIO or NIO specific code. Note that while 7.0.x has largely
>   separate code for BIO and NIO there are still significant sections of
>   code that are essentially identical so it isn't quite as simple as
>   just ruling out all the code in the BIO and NIO specific classes.
>   It also makes a JVM issue seem less likely at this point.
>
>
> Current thinking:
> - I can think of various ways this might be happening but they all seem
>   pretty unlikely. The next steps will be to enable existing logging
>   (and potentially add some custom logging) to try and narrow down where
>   the response data is disappearing.
>
> - Having reviewed the BIO code, there is a mercifully simple way to see
>   how many bytes Tomcat thinks it has written to the response. The "%b"
>   pattern in the access log valve will show how many bytes from the
>   request body that Tomcat has written to the network socket without an
>   IOException. I'd prefer something that recorded header bytes as well
>   but this is a good first start and doesn't require custom patches.
>
>
> Recommendations:
> - Switch back to the BIO connector if you haven't already. It has fewer
>   moving parts than NIO so it is simpler debug.
> - Add "%b" to the access log pattern for Tomcat's access log va

Re: Weirdest Tomcat Behavior Ever?

2020-10-23 Thread Mark Thomas

Hi Eric (and those following along),

Eric sent me some network captures off-list from multiple points in the
network from the system where this is happening. The following is the
summary so far:

Overview:
A small number of requests are receiving a completely empty (no headers,
no body) response.


Information Gathering:
Successful requests that are similar to the failed request:
- Take 7ms to 13ms to complete
- Have relatively small responses (~1k)
- Use HTTP/1.0
- Do not use keep-alive (request has Connection: close header)
- The request target is a JSP

The failed request:
- Completes in ~6ms
- Has no response headers or body
- Records a successful entry in the access log

System:
Tomcat 7.0.76, BIO HTTP connector
Java 1.8.0_221,
CentOS 7.5 server running in Azure

Requests are received from an nginx reverse proxy. It looks like nginx
is using HTTP/1.0 without keep-alive to proxy requests to Tomcat. This
actually makes things a little easier as we have one TCP/IP connection
per request.

After switching to NIO, the issue is still observable (info received
off-list along with access to network traces etc.).

The issue appeared ~1 months ago after running without error since
October 2018. No known changes were made ~1 month ago.

The TCP sequence numbers show that, as far as the network stack is
concerned, Tomcat did not write any data before closing the connection
cleanly.

There is no other activity on the client port associated with the failed
request in the provided trace.

The failed request does not appear to violate the HTTP specification.

Neither the request nor the response are compressed.

No WebSocket or other HTTP upgrade requests present in the network traces.

No obviously relevant bugs fixed since 7.0.76.


Follow-up questions:
- JVM
  - Vendor?
  - OS package or direct from Vendor?

- Tomcat
  - OS package, 3rd-party package or direct from ASF?

- Config
  - Any changes at all around the time the problems started? I'm
thinking OS updates, VM restarted etc?


Conclusions:
- It isn't timeout related. The request is completing in orders of
  magnitude less time than the timeout.

- Based on the timings it looks like the JSP is processing the request
  and generating the response.

- It happens with BIO so sendfile isn't a factor.

- No compression so GZIP issues aren't a factor.

- Given that the issue occurs with both BIO and NIO that rules out a bug
  in any BIO or NIO specific code. Note that while 7.0.x has largely
  separate code for BIO and NIO there are still significant sections of
  code that are essentially identical so it isn't quite as simple as
  just ruling out all the code in the BIO and NIO specific classes.
  It also makes a JVM issue seem less likely at this point.


Current thinking:
- I can think of various ways this might be happening but they all seem
  pretty unlikely. The next steps will be to enable existing logging
  (and potentially add some custom logging) to try and narrow down where
  the response data is disappearing.

- Having reviewed the BIO code, there is a mercifully simple way to see
  how many bytes Tomcat thinks it has written to the response. The "%b"
  pattern in the access log valve will show how many bytes from the
  request body that Tomcat has written to the network socket without an
  IOException. I'd prefer something that recorded header bytes as well
  but this is a good first start and doesn't require custom patches.


Recommendations:
- Switch back to the BIO connector if you haven't already. It has fewer
  moving parts than NIO so it is simpler debug.
- Add "%b" to the access log pattern for Tomcat's access log valve to
  record the number of body (excluding headers) bytes Tomcat believes it
  has written to the response.


Next steps:
- Wait for the issue to re-occur after the recommended changes above and
  depending on what is recorded in the access log for %b for a failed
  request, shift the focus accordingly.
- Answers to the additional questions would be nice but the access log
  %b value for a failed request is the key piece of information required
  at this point.


HTH,

Mark

(I'm trimming the previous discussion on this thread as I think all the
relevant info is summarized above but do point out if I have missed
something.)

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, October 16, 2020 8:02 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 16/10/2020 12:37, Eric Robinson wrote:
> >> From: Mark Thomas 
>
> 
>
> >> I'd like to see those screen shots please. Better still would be
> >> access to the captures themselves (just the relevant connections not
> >> the whole thing). I believe what you are telling us but long
> >> experience tells me it is best to double check the original data as well.
> >>
> >
> > I'll send you a link to the screen shot first, then I'll package up the 
> > captures
> and send a link to that in a bit. As the files may contain somewhat sensitive
> information, I'll send a secure mail direct to your inbox.
>
> Thanks. The screenshots didn't shed any light on this so far.
>
> >> I have observed something similar ish in the CI systems. In that case
> >> it is the requests that disappear. Client side logging shows the
> >> request was made but there is no sign of it ever being received by
> >> Tomcat. I don't have network traces for that (yet) so I'm not sure where
> the data is going missing.
> >>
> >> I am beginning to suspect there is a hard to trigger Tomcat or JVM
> >> bug here. I think a Tomcat bug is more likely although I have been
> >> over the code several times and I don't see anything.
> >>
> >
> > I'm thinking a bug of some kind, too, but I've been hosting about 1800
> instances of tomcat for 15 years and I have never seen this behavior before.
> >
> >> A few more questions:
> >>
> >
> > This is where I will begin to struggle bit.
> >
> >> Which HTTP connector are you using? BIO, NIO or APR/Native?
> >>
> >
> > I believe BIO is the default? server.xml just says...
> >
> >  >connectionTimeout="2"
> >redirectPort="8443" />
>
> That will be BIO or APR/Native depending on whether you have Tomcat
> Native installed. If you look at the logs for when Tomcat starts you should 
> see
> something like:
>
> INFO: Initializing ProtocolHandler ["http-bio-3016"] or
> INFO: Initializing ProtocolHandler ["http-apr-3016"]
>
> What do you see between the square brackets?

["http-bio-3016"]

>
> >> Is the issue reproducible if you switch to a different connector?
> >>
> >
> > In 15 years of using tomcat in production, we've never tried switching the
> connector type. (Probably because the app vendor never suggested it.) I did
> a little research and I'm beginning to think about the pros/cons.
>
> If you wanted to try this, I'd recommend:
>
> protocol="org.apache.coyote.http11.Http11NioProtocol"
>

We're in the middle of a production day so I want to avoid restarting tomcat if 
I can, but I'll plan to change that tonight.

> >> How easy is it for you to reproduce this issue?
> >>
> >
> > It's not reproducible at will but it happens frequently enough that we don't
> have to wait long for it to happen. I have wireshark capturing to disk
> continuously and rotating the files at 10 minute intervals to keep them
> smallish. Then I just tail the logs and wait.
>
> Ack.
>
> >> How are you linking the request you see in the access log with the
> >> request you see in Wireshark?
> >
> > Aside from the timestamp of the packets and the timestamp of the tomcat
> log messages, each HTTP request also contains a high-resolution timestamp
> and a unique random number. That way, even if the same request occurs
> multiple times in rapid succession, we can still isolate the exact one that
> failed.
>
> Excellent.
>
> >> How comfortable are you running a patched version of Tomcat (drop
> >> class files provided by me into $CATALINA_BASE/lib in the right
> >> directory structure and restart Tomcat)? Just thinking ahead about
> >> collecting additional debug information.
> >
> > That would be a tricky in our production environment, but the users are
> getting desperate enough that we'd be willing to explore that approach.
>
> Understood.
>
> Some other questions that have come to mind:
>
> - Has this app always had this problem?
>

No, it's been running fine in this environment since October 2018.

> - If not, when did it start and what changed at that point (JVM version,
> Tomcat version etc)
>

This is a new thing

Re: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Mark Thomas

On 16/10/2020 12:37, Eric Robinson wrote:
>> From: Mark Thomas 



>> I'd like to see those screen shots please. Better still would be access to 
>> the
>> captures themselves (just the relevant connections not the whole thing). I
>> believe what you are telling us but long experience tells me it is best to
>> double check the original data as well.
>>
> 
> I'll send you a link to the screen shot first, then I'll package up the 
> captures and send a link to that in a bit. As the files may contain somewhat 
> sensitive information, I'll send a secure mail direct to your inbox.

Thanks. The screenshots didn't shed any light on this so far.

>> I have observed something similar ish in the CI systems. In that case it is 
>> the
>> requests that disappear. Client side logging shows the request was made but
>> there is no sign of it ever being received by Tomcat. I don't have network
>> traces for that (yet) so I'm not sure where the data is going missing.
>>
>> I am beginning to suspect there is a hard to trigger Tomcat or JVM bug here. 
>> I
>> think a Tomcat bug is more likely although I have been over the code several
>> times and I don't see anything.
>>
> 
> I'm thinking a bug of some kind, too, but I've been hosting about 1800 
> instances of tomcat for 15 years and I have never seen this behavior before.
> 
>> A few more questions:
>>
> 
> This is where I will begin to struggle bit.
> 
>> Which HTTP connector are you using? BIO, NIO or APR/Native?
>>
> 
> I believe BIO is the default? server.xml just says...
> 
> connectionTimeout="2"
>redirectPort="8443" />

That will be BIO or APR/Native depending on whether you have Tomcat
Native installed. If you look at the logs for when Tomcat starts you
should see something like:

INFO: Initializing ProtocolHandler ["http-bio-3016"]
or
INFO: Initializing ProtocolHandler ["http-apr-3016"]

What do you see between the square brackets?

>> Is the issue reproducible if you switch to a different connector?
>>
> 
> In 15 years of using tomcat in production, we've never tried switching the 
> connector type. (Probably because the app vendor never suggested it.) I did a 
> little research and I'm beginning to think about the pros/cons.

If you wanted to try this, I'd recommend:

protocol="org.apache.coyote.http11.Http11NioProtocol"

>> How easy is it for you to reproduce this issue?
>>
> 
> It's not reproducible at will but it happens frequently enough that we don't 
> have to wait long for it to happen. I have wireshark capturing to disk 
> continuously and rotating the files at 10 minute intervals to keep them 
> smallish. Then I just tail the logs and wait.

Ack.

>> How are you linking the request you see in the access log with the request
>> you see in Wireshark?
> 
> Aside from the timestamp of the packets and the timestamp of the tomcat log 
> messages, each HTTP request also contains a high-resolution timestamp and a 
> unique random number. That way, even if the same request occurs multiple 
> times in rapid succession, we can still isolate the exact one that failed.

Excellent.

>> How comfortable are you running a patched version of Tomcat (drop class
>> files provided by me into $CATALINA_BASE/lib in the right directory structure
>> and restart Tomcat)? Just thinking ahead about collecting additional debug
>> information.
> 
> That would be a tricky in our production environment, but the users are 
> getting desperate enough that we'd be willing to explore that approach.

Understood.

Some other questions that have come to mind:

- Has this app always had this problem?

- If not, when did it start and what changed at that point (JVM version,
Tomcat version etc)

- I notice the the requests in the screenshots aren't using HTTP
keep-alive. Do you know why?

- How heavily loaded is the system (typical requests per second etc)?

- I'm assuming there is nothing in the log files when the failed request
happens. Is this correct?

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Daniel Skiles

This is a bit of a long shot, but are you running on VMWare ESX by chance?
A few years ago, I had a problem where Tomcat would send data down the
socket, but about half the message wouldn't actually arrive at the client.

On Fri, Oct 16, 2020 at 5:05 AM Eric Robinson 
wrote:

> Hi Mark --
>
> Those are great questions. See answers below.
>
>
> > -Original Message-
> > From: Mark Thomas 
> > Sent: Friday, October 16, 2020 2:20 AM
> > To: users@tomcat.apache.org
> > Subject: Re: Weirdest Tomcat Behavior Ever?
> >
> > On 16/10/2020 00:27, Eric Robinson wrote:
> >
> > 
> >
> > > The localhost_access log shows a request received and an HTTP 200
> > response sent, as follows...
> > >
> > > 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> > > /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> > 15&multiResFacFi
> > >
> > lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappproc
> > e
> > > ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063 HTTP/1.0
> > > ?gizmoid=64438&weekday=5&aptdate=2020-10-
> > 15&multiResFacFilterId=0&sess
> > >
> > ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&
> > rn
> > > d2=0.0715816×tamp=15102020125245.789063 200
> > >
> > > But WireShark shows what really happened. The server received the GET
> > request, and then it sent a FIN to terminate the connection. So if
> tomcat sent
> > an HTTP response, it did not make it out the Ethernet card.
> > >
> > > Is this the weirdest thing or what? Ideas would sure be appreciated!
> >
> > I am assuming there is a typo in your Java version and you are using
> Java 8.
> >
>
> Yes, Java 8.
>
> > That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in less
> than 6
> > months). If you aren't already planning to upgrade (I'd suggest to
> 9.0.x) then
> > you might want to start thinking about it.
> >
>
> Vendor constraint. It's a canned application published by a national
> software company, and they have not officially approved tomcat 8 for use on
> Linux yet.
>
> > I have a few ideas about what might be going on but rather than fire out
> > random theories I have some questions that might help narrow things down.
> >
> > 1. If this request was successful, how big is the response?
> >
>
> 1035 bytes.
>
> > 2. If this request was successful, how long would it typically take to
> > complete?
> >
>
> Under 60 ms.
>
> > 3. Looking at the Wireshark trace for a failed request, how long after
> the last
> > byte of the request is sent by the client does Tomcat send the FIN?
> >
>
> Maybe 100 microseconds.
>
> > 4. Looking at the Wireshark trace for a failed request, is the request
> fully sent
> > (including terminating CRLF etc)?
> >
>
> Yes, the request as seen by the tomcat server is complete and is
> terminated by 0D 0A.
>
> > 5. Are there any proxies, firewalls etc between the user agent and
> Tomcat?
> >
>
> User agent -> firewall -> nginx plus -> upstream tomcat servers
>
> > 6. What timeouts are configured for the Connector?
> >
>
> Sorry, which connector are you referring to?
>
> > 7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?
> >
>
> HTTP/1.1
>
> > 8. Where are you running Wireshark? User agent? Tomcat? Somewhere
> > else?
>
> On the nginx proxy and both upstream tomcat servers. (On the user agent,
> too, but that doesn't help us in this case.)
>
> If you would like to see a screen shot showing all 4 captures
> side-by-size, I can send you a secure link. It will verify my answers
> above. It shows 4 separate WireShark captures taken simultaneously:
>
> (a) the request going from the nginx proxy to tomcat 1
> (b) tomcat 1 receiving the request and terminating the connection
> (c) nginx sending the request to tomcat 2
> (d) tomcat 2 replying to the request (but the reply does not help the user
> because the tomcat server does not recognize the user agent's JSESSIONID
> cookie, so it responds "invalid session."
>
> >
> > Mark
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> > For additional commands, e-mail: users-h...@tomcat.apache.org
>
> Disclaimer : This email and any files transmitted with it are confidential
> and intended solely for intended recipients. If you are not the named
> addressee you should not disseminate, distribute, copy or alter this email.
> Any views or opinions presented in this email are solely those of the
> author and might not represent those of Physician Select Management.
> Warning: Although Physician Select Management has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
>
>

RE: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Eric Robinson

> -Original Message-
> From: Mark Thomas 
> Sent: Friday, October 16, 2020 5:17 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 16/10/2020 10:05, Eric Robinson wrote:
> > Hi Mark --
> >
> > Those are great questions. See answers below.
> >
> >
> >> -Original Message-
> >> From: Mark Thomas 
> >> Sent: Friday, October 16, 2020 2:20 AM
> >> To: users@tomcat.apache.org
> >> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>
> >> On 16/10/2020 00:27, Eric Robinson wrote:
> >>
> >> 
> >>
> >>> The localhost_access log shows a request received and an HTTP 200
> >> response sent, as follows...
> >>>
> >>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> >>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> >> 15&multiResFacFi
> >>>
> >>
> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappproc
> >> e
> >>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063
> HTTP/1.0
> >>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
> >> 15&multiResFacFilterId=0&sess
> >>>
> >>
> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&
> >> rn
> >>> d2=0.0715816×tamp=15102020125245.789063 200
> >>>
> >>> But WireShark shows what really happened. The server received the
> >>> GET
> >> request, and then it sent a FIN to terminate the connection. So if
> >> tomcat sent an HTTP response, it did not make it out the Ethernet card.
> >>>
> >>> Is this the weirdest thing or what? Ideas would sure be appreciated!
> >>
> >> I am assuming there is a typo in your Java version and you are using Java 
> >> 8.
> >>
> >
> > Yes, Java 8.
> >
> >> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in
> >> less than 6 months). If you aren't already planning to upgrade (I'd
> >> suggest to 9.0.x) then you might want to start thinking about it.
> >>
> >
> > Vendor constraint. It's a canned application published by a national
> software company, and they have not officially approved tomcat 8 for use on
> Linux yet.
> >
> >> I have a few ideas about what might be going on but rather than fire
> >> out random theories I have some questions that might help narrow things
> down.
> >>
> >> 1. If this request was successful, how big is the response?
> >>
> >
> > 1035 bytes.
> >
> >> 2. If this request was successful, how long would it typically take
> >> to complete?
> >>
> >
> > Under 60 ms.
> >
> >> 3. Looking at the Wireshark trace for a failed request, how long
> >> after the last byte of the request is sent by the client does Tomcat send
> the FIN?
> >>
> >
> > Maybe 100 microseconds.
> >
> >> 4. Looking at the Wireshark trace for a failed request, is the
> >> request fully sent (including terminating CRLF etc)?
> >>
> >
> > Yes, the request as seen by the tomcat server is complete and is
> terminated by 0D 0A.
> >
> >> 5. Are there any proxies, firewalls etc between the user agent and
> Tomcat?
> >>
> >
> > User agent -> firewall -> nginx plus -> upstream tomcat servers
> >
> >> 6. What timeouts are configured for the Connector?
> >>
> >
> > Sorry, which connector are you referring to?
> >
> >> 7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?
> >>
> >
> > HTTP/1.1
> >
> >> 8. Where are you running Wireshark? User agent? Tomcat? Somewhere
> >> else?
> >
> > On the nginx proxy and both upstream tomcat servers. (On the user
> > agent, too, but that doesn't help us in this case.)
> >
> > If you would like to see a screen shot showing all 4 captures side-by-size, 
> > I
> can send you a secure link. It will verify my answers above. It shows 4
> separate WireShark captures taken simultaneously:
> >
> > (a) the request going from the nginx proxy to tomcat 1
> > (b) tomcat 1 receiving the request and terminating the connection
> > (c) nginx sending the request to tomcat 2
> > (d) tomcat 2 replying to the request (but the reply does not help the user
> because the tomcat server does not recognize the user agent's JSESSIONID
> cookie, so it responds &

Re: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Paul Carter-Brown

Hi Eric,

These weird situations are sometimes best looked at by confirming what the
OS is seeing from user-space.

Can you run: sudo strace -r -f -e trace=network -p 

You can then log that to a file and correlate and see if the kernel is in
fact being asked to send the response.

It's very insightful to  see what is actually going on between the JVM and
Kernel.

Paul

On Fri, Oct 16, 2020 at 12:16 PM Mark Thomas  wrote:

> On 16/10/2020 10:05, Eric Robinson wrote:
> > Hi Mark --
> >
> > Those are great questions. See answers below.
> >
> >
> >> -Original Message-
> >> From: Mark Thomas 
> >> Sent: Friday, October 16, 2020 2:20 AM
> >> To: users@tomcat.apache.org
> >> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>
> >> On 16/10/2020 00:27, Eric Robinson wrote:
> >>
> >> 
> >>
> >>> The localhost_access log shows a request received and an HTTP 200
> >> response sent, as follows...
> >>>
> >>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> >>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> >> 15&multiResFacFi
> >>>
> >> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappproc
> >> e
> >>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063 HTTP/1.0
> >>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
> >> 15&multiResFacFilterId=0&sess
> >>>
> >> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&
> >> rn
> >>> d2=0.0715816×tamp=15102020125245.789063 200
> >>>
> >>> But WireShark shows what really happened. The server received the GET
> >> request, and then it sent a FIN to terminate the connection. So if
> tomcat sent
> >> an HTTP response, it did not make it out the Ethernet card.
> >>>
> >>> Is this the weirdest thing or what? Ideas would sure be appreciated!
> >>
> >> I am assuming there is a typo in your Java version and you are using
> Java 8.
> >>
> >
> > Yes, Java 8.
> >
> >> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in less
> than 6
> >> months). If you aren't already planning to upgrade (I'd suggest to
> 9.0.x) then
> >> you might want to start thinking about it.
> >>
> >
> > Vendor constraint. It's a canned application published by a national
> software company, and they have not officially approved tomcat 8 for use on
> Linux yet.
> >
> >> I have a few ideas about what might be going on but rather than fire out
> >> random theories I have some questions that might help narrow things
> down.
> >>
> >> 1. If this request was successful, how big is the response?
> >>
> >
> > 1035 bytes.
> >
> >> 2. If this request was successful, how long would it typically take to
> >> complete?
> >>
> >
> > Under 60 ms.
> >
> >> 3. Looking at the Wireshark trace for a failed request, how long after
> the last
> >> byte of the request is sent by the client does Tomcat send the FIN?
> >>
> >
> > Maybe 100 microseconds.
> >
> >> 4. Looking at the Wireshark trace for a failed request, is the request
> fully sent
> >> (including terminating CRLF etc)?
> >>
> >
> > Yes, the request as seen by the tomcat server is complete and is
> terminated by 0D 0A.
> >
> >> 5. Are there any proxies, firewalls etc between the user agent and
> Tomcat?
> >>
> >
> > User agent -> firewall -> nginx plus -> upstream tomcat servers
> >
> >> 6. What timeouts are configured for the Connector?
> >>
> >
> > Sorry, which connector are you referring to?
> >
> >> 7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?
> >>
> >
> > HTTP/1.1
> >
> >> 8. Where are you running Wireshark? User agent? Tomcat? Somewhere
> >> else?
> >
> > On the nginx proxy and both upstream tomcat servers. (On the user agent,
> too, but that doesn't help us in this case.)
> >
> > If you would like to see a screen shot showing all 4 captures
> side-by-size, I can send you a secure link. It will verify my answers
> above. It shows 4 separate WireShark captures taken simultaneously:
> >
> > (a) the request going from the nginx proxy to tomcat 1
> > (b) tomcat 1 receiving the request and terminating the connection
> > (c) nginx sending the requ

Re: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Mark Thomas

On 16/10/2020 10:05, Eric Robinson wrote:
> Hi Mark --
> 
> Those are great questions. See answers below.
> 
> 
>> -Original Message-
>> From: Mark Thomas 
>> Sent: Friday, October 16, 2020 2:20 AM
>> To: users@tomcat.apache.org
>> Subject: Re: Weirdest Tomcat Behavior Ever?
>>
>> On 16/10/2020 00:27, Eric Robinson wrote:
>>
>> 
>>
>>> The localhost_access log shows a request received and an HTTP 200
>> response sent, as follows...
>>>
>>> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
>>> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
>> 15&multiResFacFi
>>>
>> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappproc
>> e
>>> ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063 HTTP/1.0
>>> ?gizmoid=64438&weekday=5&aptdate=2020-10-
>> 15&multiResFacFilterId=0&sess
>>>
>> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&
>> rn
>>> d2=0.0715816×tamp=15102020125245.789063 200
>>>
>>> But WireShark shows what really happened. The server received the GET
>> request, and then it sent a FIN to terminate the connection. So if tomcat 
>> sent
>> an HTTP response, it did not make it out the Ethernet card.
>>>
>>> Is this the weirdest thing or what? Ideas would sure be appreciated!
>>
>> I am assuming there is a typo in your Java version and you are using Java 8.
>>
> 
> Yes, Java 8.
> 
>> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in less than 6
>> months). If you aren't already planning to upgrade (I'd suggest to 9.0.x) 
>> then
>> you might want to start thinking about it.
>>
> 
> Vendor constraint. It's a canned application published by a national software 
> company, and they have not officially approved tomcat 8 for use on Linux yet.
> 
>> I have a few ideas about what might be going on but rather than fire out
>> random theories I have some questions that might help narrow things down.
>>
>> 1. If this request was successful, how big is the response?
>>
> 
> 1035 bytes.
> 
>> 2. If this request was successful, how long would it typically take to
>> complete?
>>
> 
> Under 60 ms.
> 
>> 3. Looking at the Wireshark trace for a failed request, how long after the 
>> last
>> byte of the request is sent by the client does Tomcat send the FIN?
>>
> 
> Maybe 100 microseconds.
> 
>> 4. Looking at the Wireshark trace for a failed request, is the request fully 
>> sent
>> (including terminating CRLF etc)?
>>
> 
> Yes, the request as seen by the tomcat server is complete and is terminated 
> by 0D 0A.
> 
>> 5. Are there any proxies, firewalls etc between the user agent and Tomcat?
>>
> 
> User agent -> firewall -> nginx plus -> upstream tomcat servers
> 
>> 6. What timeouts are configured for the Connector?
>>
> 
> Sorry, which connector are you referring to?
> 
>> 7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?
>>
> 
> HTTP/1.1
> 
>> 8. Where are you running Wireshark? User agent? Tomcat? Somewhere
>> else?
> 
> On the nginx proxy and both upstream tomcat servers. (On the user agent, too, 
> but that doesn't help us in this case.)
> 
> If you would like to see a screen shot showing all 4 captures side-by-size, I 
> can send you a secure link. It will verify my answers above. It shows 4 
> separate WireShark captures taken simultaneously:
> 
> (a) the request going from the nginx proxy to tomcat 1
> (b) tomcat 1 receiving the request and terminating the connection
> (c) nginx sending the request to tomcat 2
> (d) tomcat 2 replying to the request (but the reply does not help the user 
> because the tomcat server does not recognize the user agent's JSESSIONID 
> cookie, so it responds "invalid session."

Hmm.

That rules out most of my ideas.

I'd like to see those screen shots please. Better still would be access
to the captures themselves (just the relevant connections not the whole
thing). I believe what you are telling us but long experience tells me
it is best to double check the original data as well.

I have observed something similar ish in the CI systems. In that case it
is the requests that disappear. Client side logging shows the request
was made but there is no sign of it ever being received by Tomcat. I
don't have network traces for that (yet) so I'm not sure where the data
is going missing.

I am beginning to suspect there is a hard t

RE: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Eric Robinson

> > 6. What timeouts are configured for the Connector?
> >
>
> Sorry, which connector are you referring to?
>

Sorry, I'm a dummy. Obviously you mean the tomcat connector.

connectionTimeout="2"

-Eric
Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Eric Robinson

Hi Mark --

Those are great questions. See answers below.


> -Original Message-
> From: Mark Thomas 
> Sent: Friday, October 16, 2020 2:20 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> On 16/10/2020 00:27, Eric Robinson wrote:
>
> 
>
> > The localhost_access log shows a request received and an HTTP 200
> response sent, as follows...
> >
> > 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET
> > /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-
> 15&multiResFacFi
> >
> lterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappproc
> e
> > ssid=40696&rnd2=0.0715816×tamp=15102020125245.789063 HTTP/1.0
> > ?gizmoid=64438&weekday=5&aptdate=2020-10-
> 15&multiResFacFilterId=0&sess
> >
> ionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&
> rn
> > d2=0.0715816×tamp=15102020125245.789063 200
> >
> > But WireShark shows what really happened. The server received the GET
> request, and then it sent a FIN to terminate the connection. So if tomcat sent
> an HTTP response, it did not make it out the Ethernet card.
> >
> > Is this the weirdest thing or what? Ideas would sure be appreciated!
>
> I am assuming there is a typo in your Java version and you are using Java 8.
>

Yes, Java 8.

> That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in less than 6
> months). If you aren't already planning to upgrade (I'd suggest to 9.0.x) then
> you might want to start thinking about it.
>

Vendor constraint. It's a canned application published by a national software 
company, and they have not officially approved tomcat 8 for use on Linux yet.

> I have a few ideas about what might be going on but rather than fire out
> random theories I have some questions that might help narrow things down.
>
> 1. If this request was successful, how big is the response?
>

1035 bytes.

> 2. If this request was successful, how long would it typically take to
> complete?
>

Under 60 ms.

> 3. Looking at the Wireshark trace for a failed request, how long after the 
> last
> byte of the request is sent by the client does Tomcat send the FIN?
>

Maybe 100 microseconds.

> 4. Looking at the Wireshark trace for a failed request, is the request fully 
> sent
> (including terminating CRLF etc)?
>

Yes, the request as seen by the tomcat server is complete and is terminated by 
0D 0A.

> 5. Are there any proxies, firewalls etc between the user agent and Tomcat?
>

User agent -> firewall -> nginx plus -> upstream tomcat servers

> 6. What timeouts are configured for the Connector?
>

Sorry, which connector are you referring to?

> 7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?
>

HTTP/1.1

> 8. Where are you running Wireshark? User agent? Tomcat? Somewhere
> else?

On the nginx proxy and both upstream tomcat servers. (On the user agent, too, 
but that doesn't help us in this case.)

If you would like to see a screen shot showing all 4 captures side-by-size, I 
can send you a secure link. It will verify my answers above. It shows 4 
separate WireShark captures taken simultaneously:

(a) the request going from the nginx proxy to tomcat 1
(b) tomcat 1 receiving the request and terminating the connection
(c) nginx sending the request to tomcat 2
(d) tomcat 2 replying to the request (but the reply does not help the user 
because the tomcat server does not recognize the user agent's JSESSIONID 
cookie, so it responds "invalid session."

>
> Mark
>
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Weirdest Tomcat Behavior Ever?

2020-10-16 Thread Mark Thomas

On 16/10/2020 00:27, Eric Robinson wrote:

> The localhost_access log shows a request received and an HTTP 200 response 
> sent, as follows...
> 
> 10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET 
> /app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-15&multiResFacFilterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&rnd2=0.0715816×tamp=15102020125245.789063
>  HTTP/1.0 
> ?gizmoid=64438&weekday=5&aptdate=2020-10-15&multiResFacFilterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&rnd2=0.0715816×tamp=15102020125245.789063
>  200
> 
> But WireShark shows what really happened. The server received the GET 
> request, and then it sent a FIN to terminate the connection. So if tomcat 
> sent an HTTP response, it did not make it out the Ethernet card.
> 
> Is this the weirdest thing or what? Ideas would sure be appreciated!

I am assuming there is a typo in your Java version and you are using Java 8.

That Tomcat version is over 3.5 years old (and Tomcat 7 is EOL in less
than 6 months). If you aren't already planning to upgrade (I'd suggest
to 9.0.x) then you might want to start thinking about it.

I have a few ideas about what might be going on but rather than fire out
random theories I have some questions that might help narrow things down.

1. If this request was successful, how big is the response?

2. If this request was successful, how long would it typically take to
complete?

3. Looking at the Wireshark trace for a failed request, how long after
the last byte of the request is sent by the client does Tomcat send the FIN?

4. Looking at the Wireshark trace for a failed request, is the request
fully sent (including terminating CRLF etc)?

5. Are there any proxies, firewalls etc between the user agent and Tomcat?

6. What timeouts are configured for the Connector?

7. Is this HTTP/1.1, HTTP/2, AJP, with or without TLS?

8. Where are you running Wireshark? User agent? Tomcat? Somewhere else?

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Weirdest Tomcat Behavior Ever?

2020-10-15 Thread Eric Robinson

Has anyone ever seen a situation where tomcat occasionally fails to send 
responses but still logs them?

On a CentOS 7.5 server running in Azure with tomcat 7.0.76 with java 1.0.0_221, 
everything runs fine 99.99% of the time, but that last hundredth of a percent 
is a bitch. However, intermittently the server receives a request and then just 
terminates the connection without responding. But the localhost_access log 
shows that it DID respond.

Let me say that again.

The localhost_access log shows a request received and an HTTP 200 response 
sent, as follows...

10.51.14.133 [15/Oct/2020:12:52:45 -0400] 57 GET 
/app/code.jsp?gizmoid=64438&weekday=5&aptdate=2020-10-15&multiResFacFilterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&rnd2=0.0715816×tamp=15102020125245.789063
 HTTP/1.0 
?gizmoid=64438&weekday=5&aptdate=2020-10-15&multiResFacFilterId=0&sessionDID=0&GzUserId=71340&zztusrlogtblid=321072&zztappprocessid=40696&rnd2=0.0715816×tamp=15102020125245.789063
 200

But WireShark shows what really happened. The server received the GET request, 
and then it sent a FIN to terminate the connection. So if tomcat sent an HTTP 
response, it did not make it out the Ethernet card.

Is this the weirdest thing or what? Ideas would sure be appreciated!

-Eric

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

84 matches

Mail list logo