> From: Christopher Schultz <ch...@christopherschultz.net>
> Sent: Tuesday, November 3, 2020 9:26 AM
> To: users@tomcat.apache.org
> Subject: Re: Weirdest Tomcat Behavior Ever?
>
> Eric,
>
> On 11/3/20 10:05, Eric Robinson wrote:
> >> -----Original Message-----
> >> From: Eric Robinson <eric.robin...@psmnv.com>
> >> Sent: Tuesday, November 3, 2020 8:21 AM
> >> To: Tomcat Users List <users@tomcat.apache.org>
> >> Subject: RE: Weirdest Tomcat Behavior Ever?
> >>
> >>> From: Mark Thomas <ma...@apache.org>
> >>> Sent: Tuesday, November 3, 2020 2:06 AM
> >>> To: Tomcat Users List <users@tomcat.apache.org>
> >>> Subject: Re: Weirdest Tomcat Behavior Ever?
> >>>
> >>> On 02/11/2020 12:16, Eric Robinson wrote:
> >>>
> >>> <snip/>
> >>>
> >>>> Gotcha, thanks for the clarification. Let's see what happens when
> >>>> the users
> >>> start hitting it at 8:00 am Eastern.
> >>>
> >>> Progress. The first attempt to write to the socket triggers the
> >>> following
> >>> exception:
> >>>
> >>> 02-Nov-2020 14:33:54.083 FINE [http-bio-3016-exec-13]
> >>> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write
> >>> [301361476]
> >>>   java.net.SocketException: Bad file descriptor (Write failed)
> >>>      at java.net.SocketOutputStream.socketWrite0(Native Method)
> >>>      at
> >>>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> >>>      at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> >>>      at
> >>>
> >>
> org.apache.tomcat.util.net.JIoEndpoint$DebugOutputStream.write(JIoEnd
> >> p
> >>> oint.java:1409)
> >>>      at
> >>> org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(Interna
> >>> lO
> >>> ut
> >>> putBuffer.java:247)
> >>>      at
> >> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480)
> >>>      at
> >>>
> >>
> org.apache.coyote.http11.InternalOutputBuffer.endRequest(InternalOutp
> >> u
> >>> t
> >>> Buffer.java:183)
> >>>      at
> >>> org.apache.coyote.http11.AbstractHttp11Processor.action(AbstractHttp
> >>> 11
> >>> Pr
> >>> ocessor.java:761)
> >>>      at org.apache.coyote.Response.action(Response.java:174)
> >>>      at org.apache.coyote.Response.finish(Response.java:274)
> >>>      at
> >>>
> org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:322)
> >>>      at
> >>>
> >> org.apache.catalina.connector.Response.finishResponse(Response.java:5
> >> 37)
> >>>      at
> >>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.ja
> >>> va
> >>> :4
> >>> 80)
> >>>      at
> >>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHtt
> >>> p1
> >>> 1P
> >>> rocessor.java:1083)
> >>>      at
> >>>
> >>
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(
> >> A
> >>> b
> >>> stractProtocol.java:640)
> >>>      at
> >>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoi
> >>> nt
> >>> .ja
> >>> va:321)
> >>>      at
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor
> >>> .j
> >>> av
> >>> a:1149)
> >>>      at
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> >>> ja
> >>> va:624)
> >>>      at
> >>>
> >>
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskTh
> >> r
> >>> ead.java:61)
> >>>      at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> Because this is an instance of an IOException, Tomcat assumes it has
> >>> been caused by the client dropping the connection and silently
> >>> swallows it. I'll be changing that later today so the exception is
> >>> logged as DEBUG level for all new Tomcat versions.
> >>>
> >>> Possible causes of "java.net.SocketException: Bad file descriptor"
> >>> I've been able to find are:
> >>>
> >>> 1. OS running out of file descriptors.
> >>>
> >>> 2.Trying to use a closed socket.
> >>>
> >>> I want to review the source code to see if there are any others.
> >>>
> >>> I don't think we are seeing 2 as there is no indication of the
> >>> Socket, InputStream or OutputStream being closed in the logs.
> >>>
> >>> That leaves 1. Possible causes here are a file descriptor leak or
> >>> normal operations occasionally needing more than the current limit.
> >>> I don't think it is a leak as I'd expect to see many more errors of
> >>> this type after the first and we aren't seeing that. That leaves the
> >>> possibility of the current limit being a little too low.
> >>>
> >>> My recommendation at this point is to increase the limit for file
> descriptors.
> >>> Meanwhile, I'll look at the JRE source to see if there are any other
> >>> possible triggers for this exception.
> >>>
> >>> Mark
> >>>
> >>>
> >>
> >> On the tomcat server, max open file descriptors is currently 2853957
> >>
> >> [root@001app01a ~]# sysctl fs.file-max fs.file-max = 2853957
> >>
> >> Most of the time, the number of open files appears to run about 600,000.
> >>
> >>   What do you think of watching the open file count and seeing if the
> >> number gets up around the ceiling when the socket write failure
> >> occurs? Something like...
> >>
> >> [root@001app01a ~]#  while [ TRUE ];do FILES=$(lsof|wc -l);echo
> >> "$(date
> >> +%H:%M:%S) $FILES";done
> >> 09:11:15 591671
> >> 09:11:35 627347
> >> 09:11:54 626851
> >> 09:12:11 626429
> >> 09:12:26 545748
> >> 09:12:42 548578
> >> 09:12:58 551487
> >> 09:13:14 516700
> >> 09:13:30 513312
> >> 09:13:45 512830
> >> 09:14:02 533338
> >> 09:14:18 568233
> >> 09:14:35 570158
> >> 09:14:51 566269
> >> 09:15:07 547389
> >> 09:15:23 544203
> >> 09:15:38 546395
> >>
> >> It's not ideal; as it seems to take 15-20 seconds to count them using lsof.
> >>
> >>
> >>
> >
> > Wait, never mind. I realized the per-process limits are what matters.
> > I checked, and nofile was set to 4096 for the relevant java process.
> :)
>
> Your <Connector> configuration is really simple:
>
> >     <Connector port="3016" protocol="HTTP/1.1"
> >                connectionTimeout="20000"
> >                redirectPort="8443" />
>
> How many of these do you have per server.xml? The default
> maxConnections for BIO is 8192 which was double your previous ulimit. If you
> have 10 of them, then you could potentially have 20 times the number of
> incoming connections than you have file descriptors available.
>

Hi Chris --

Just 1 per server.xml.

> Have you noticed whether or not a large number of (simultaneous)
> connections are required to trigger this behavior? Mark said "[t]he first
> attempt to write to the socket triggers the following exception."
> I'm not sure if that was the *very first* write, or just the first write we 
> care
> about.
>

I have not seen the connection count get especially high. There are 17 
instances of tomcat on the server, and the number of connections per tomcat is 
relatively low. My count_connections script does a netstat -ant against each 
running java process and typically reports something like...

[root@001app01a conf]# count_connections
site001: 275
site002: 1379
site003: 532
site004: 144
site005: 50
site006: 662
site007: 155
site008: 1074
site009: 672
site010: 277
site011: 1096
site012: 1476
site014: 547
site015: 1884
site016: 483     <--The site in question.
site017: 8
site018: 985
site019: 123
site622: 126
site666: 8
site999: 12

I have other servers with as many as 100 instances of tomcat running the same 
canned application.

> > I did...
> >
> > # prlimit --pid 8730 --nofile=16384:16384
> >
> > That should give java some extra breathing room if the issue is max
> > open files, right?
> That will heavily depend upon your <Connector> configurations.
>

Just 1 per server.xml.

> You might want to run your "lsof" monitor on that one single JVM process
> while you are instrumenting things. Perhaps we'll see that these errors are
> triggered as soon as you hit e.g. 4096 file descriptors.
>

I'm glad we are thinking along the same lines. I did that this morning and 
found that the number of open files does not appear to spike when the error 
occurs. I ran the command...

# while [ TRUE ];do FILES=$(lsof -p 8730|wc -l);echo "$(date +%H:%M:%S) 
$FILES";done|tee open_files.txt

...where 8730 was the java pid.

An error was logged at 10:42:19 am. I checked open_files.txt, and this is what 
it showed at that same time.

10:42:17 671
10:42:17 670
10:42:17 672
10:42:18 670
10:42:18 670
10:42:18 661
10:42:18 661
10:42:19 661
10:42:19 661
10:42:19 661
10:42:19 661
10:42:20 661
10:42:20 661
10:42:20 663
10:42:20 662
10:42:21 662
10:42:21 660
10:42:21 665

If anything, the number of open files went down slightly, rather than spiking.

> You might actually want to *restore* the ulimit and then reproduce, looking
> at the file-descriptor count.
>
> You said that this behavior was "new" and you've been running fine for
> 15 years or so. Is there any reason to suspect that recently you have gotten
> significantly more traffic than in the past? Maybe you've been getting close
> to the edge for 15 years and you have finally spilled over it.
>

The environment in question has only been up since October 2018, and is 
separate from our main hosting environment where we have been running lots of 
tomcat instances for the past 15 years. The symptom first presented in August. 
I've wracked my brain and been unable to come up with any changes to the system 
or user behavior that seem like possible culprits. \

--Eric

> -chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org

Disclaimer : This email and any files transmitted with it are confidential and 
intended solely for intended recipients. If you are not the named addressee you 
should not disseminate, distribute, copy or alter this email. Any views or 
opinions presented in this email are solely those of the author and might not 
represent those of Physician Select Management. Warning: Although Physician 
Select Management has taken reasonable precautions to ensure no viruses are 
present in this email, the company cannot accept responsibility for any loss or 
damage arising from the use of this email or attachments.

Reply via email to