Re: Recent tcnative null-dereference with 8.0.0-RC3 and 7.0.45 [tcnative-1.dll+0x7e23]

Mark Thomas Thu, 03 Oct 2013 08:42:51 -0700

On 03/10/2013 16:12, Christopher Schultz wrote:
> On 10/3/13 9:18 AM, Mark Thomas wrote:


>> Having been responsible for introducing and fixing a number of
>> these issues I disagree. It is usually fairly easy to tell what
>> the problem is from the crash report (double close on a socket,
>> invalid socket in the poller, etc). What is far more difficult is
>> figuring out how things got into the known invalid state in the
>> first place. No amount of debug information at the point of the
>> crash is going to help with that.
> 
> Agreed. Given that you have been researching these issues (and
> fixing them -- thanks!) you have a unique perspective: even though
> it's easy for you to determine the cause of a crash when you can
> reproduce it, how about getting a crash report with little other
> context? Would it help you to have more information or is the crash
> report sufficient?

If I can't reproduce it then the fall-back is manual code review to
try and figure out a code path (perhaps with some hints from the OP's
report) that would result in the observed issue. As long as there is
enough info in the crash report (there typically is) to figure out
what is null / invalid then all is good. I haven't yet felt the need
for further context.

>> A hard to reproduce bug in APR that triggers a crash is no more
>> or less difficult to work with than a hard to reproduce bug in
>> NIO that triggers an unexpected socket close.
> 
> In those two cases -- NIO and APR -- is the NIO case any less 
> catastrophic?

Sometimes yes, sometimes no.

> I've been arguing that we should stop the JVM from crashing, but
> locking-up the NIO connector and rendering the server inert is
> roughly the same outcome. Is there actually any utility in stopping
> the JVM from coming down? I guess you could get a thread dump from
> an otherwise hung Tomcat, but probably not much else.

With a lock-up of infinite loop you can find out where you are (as
with APR) but the how you got there is what you really want.

For an error that would otherwise result in a socket closure then a
JVM crash is bad. On the other hand, a JVM crash is a very strong
motivation to fix an issue.

>> You may have noticed that I have slowly been adding debug code to
>> the low level connector code, primarily in the Endpoints and the
>> Protocol implementations. All of this debug code has proven
>> useful in tracking down the type of bug that triggers a crash
>> with APR.
>> 
>> Additional validity checks in the native code provide for a more 
>> graceful failure mode but offer little other benefit as the
>> useful information is more focused on how the current state was
>> achieved rather than what the current state is.
> 
> Agreed. Remember, I was just trying to stop crashes. We have a load
> of crash reports in various versions of tcnative and if the
> problem actually isn't in tcnative, it would be nice to get those
> reported against the components that actually have bugs.

Most crashes in APR/native are going to be bugs in AprEndpoint or less
likely in the APR processor and protocol.

>> I'm -0 on adding additional checks to the native code.
> 
> Noted.
> 
>> I can think of several things that would be more useful: - Better
>> Javadoc for the native methods. I can think of a number of times
>> where better docs would have saved me a fair amount of time 
>> debugging unexpected behaviour.
> 
> Do you mean more documentation about how the method works, or even
> just a simple description of what happens *at all*?

Some examples might help:
- documenting the return codes for the non-blocking reads would have
highlighted the problem that required the 1.1.28 release
- documenting the return codes for removing sockets from the poller
would have highlighted the problem that is going to require the 1.1.29
release
- documenting some of the constraints around using SSL would have
saved me some time when getting SSL and WebSocket working

As I have worked more with APR/native I have discovered various things
that it would have been a great help to have in the docs. I've tried
to remember to document them when I have found them but I've probably
missed some.

>> - Something to turn an APR error response code into meaningful
>> test.
> 
> Can you explain this in a little more detail? For example, an APR
> error code might be "bad socket", but as you say, the circumstances
> of the bug are more important than the error code. How could such a
> code be turned into a test case?

In the last few days I have come across the following error codes:
-70014
-730053

I can look them up to figure what they mean but it would save some
time if the error report included a text version.

>> - Refactoring the connectors so all socket access goes through
>> the SocketWrapper so there is a much smaller set of code to
>> validate.
> 
> I'm guessing you are tackling this task slowly over time.

I am moving slowly in this direction. My ultimate aim is to have the
connector type specific code only in the Endpoint and the
SocketWrapper. No idea if that is possible. It is a longer term goal
for Tomcat 9+ at this point.

At the very least whenever I add functionality to the connectors (e.g.
non-blocking IO) I do enough refactoring that I only have to add the
new stuff once.

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Re: Recent tcnative null-dereference with 8.0.0-RC3 and 7.0.45 [tcnative-1.dll+0x7e23]

Reply via email to