I believe I have an explanation for what is going on that fits both the
reported behaviour and the proposed fix.

Background
==========

OpenSSL tracks a list of the most recent errors for each thread in a
hash map keyed on the thread (int_thread_hash in err.c). Reading and
writing to this hash map is protected by a lock. The hash map is created
and populated lazily.

tc-native calls ERR_clear_error() before every call to
SSL_do_handshake(), SSL_read() and SSL_write(). The call to
ERR_clear_error() either clears the error list for the current thread or
inserts a new empty list into the hash map of the thread is not already
present.

The performance problem was tracked down to threads waiting in
ERR_clear_error() to obtain the write lock for the hash map.

The proposed solution was to call ERR_remove_thread_state() just before
the current Tomcat thread processing the connection is returned to the
thread pool. This method removes the current thread and its associated
error list from the hash map.


Analysis
========

The proposed solution, calling ERR_remove_thread_state(), adds a call
that also obtains the write lock for the hash map. This indicates that
the problem is not delays in obtaining the lock but contention for the
lock because one or more operations taking place within the lock are
taking a long time.

Removing unused threads from the hash map removes the bottleneck. This
points towards the hash map being the source of the problem.

Testing by the OP showed that as soon as a test had been ran that
required ~ 400 concurrent threads performance dropped significantly. It
did not get noticeably worse if the same 400 thread test was run repeatedly.

My testing indicated, on OSX at least, that the thread IDs used in the
hash map were stable and that uncontrolled growth of the hash map was
unlikely to be the cause.

The manner in which thread IDs are generated varies by platform. On
Linux, where this problem was observed, the thread ID is derived from
(is normally equal to) the memory address of the per thread errno
variable. This means that thread IDs tend to be concentrated in a
relatively narrow range of values. For example, in a simple 10 thread
test on OSX thread IDs ranged from 123145344839680 to 123145354387455.

Thread IDs therefore fall with a 10^7 range within a possible range of
1.8x10^19. i.e. a very small, contiguous range.

Hash maps use hashing functions to ensure that entries are (roughly)
evenly distributed between the available buckets. The hash function,
err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.

Supposition
===========

The hash function used (multiple by 13) is insufficient to distribute
the resulting values across multiple buckets because they will still
fall in a relatively narrow band. Therefore all the threads end up in a
single bucket which makes the performance of the hash map poor. This in
turn makes calls to thread_get_item() slow because it does a hash map
lookup. This lookup is performed with the read lock held for the hash
map which in turn will slow down the calls that require the write lock.

Proposal
========

The analysis and supposition above need to be checked by someone with a
better understanding of C than me. Assuming my work is correct, the next
step is to look at possible fixes. I do not believe that patching
OpenSSL is a viable option.

The OpenSSL API needs to be reviewed to see if there is a way to avoid
the calls that require the write lock.

If the write lock cannot be avoided then we need to see if there is a
better place to call ERR_remove_thread_state(). I'd like to fix this
entirely in tc-native but that may mean calling
ERR_remove_thread_state() more frequently which could create its own
performance problems.

Nate - I may have some patches for you to test in the next few days.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to