Chris,

On 9/28/20 02:40, Christopher Schultz wrote:
Carsten,

On 9/27/20 05:53, Carsten Klein wrote:
Any comments on that? Is it worth preparing a PR?

Regular expressions are fairly expensive.

Yes, but my measurements of the HashSet-lookups were wrong, since hashValue() of a String gets cached, so, while measuring in a loop, the hash value gets computed only once. Setting up a fair test is challenging (using new strings in the loop causes memory allocations and GCs). I ended with additionally calling a clone of the String's hashValue() function in the loop.

Now, testing an origin with HashSet.contains(origin) (current solution) takes about 19 ns for a cache-miss (empty list in bucket) and 30 ns for a cache-hit) on my box.

In contrast, evaluating a regular expression takes about 120 ns; so these are about 6 times slower (NOT 25 times as stated in my last mail).

The creation of a new Matcher instance each time is a significant bottleneck (in my measurement loop): reusing the same Matcher instance and resetting it with a new input string makes the test taking only about 75 ns (that's only about 4 times slower that the HashSet test).

So, regular expressions are not as bad as we were thinking?


If there is a way to build the code such that some subset of wildcards
can be serviced without regex (and of course exact matches without using
regex), then I'm at least a +0 on this.

I never intended to implement tests for exact matches with regular expressions. Configured "exact" origins (those without wildcards and not enclosed between slashes) will still be tested with HashSet.contains, of course. So, there's no change (considering performance) for Tomcat setups only using "exactly" defined allowed origins.

If someone uses the new "inexact" origin specifiers, will it not be comprehensible that these are more expensive (from a performance point of view)?

It may seem like over-engineering, but maybe creating a Matcher
interface and a Factory (I know, I know) which produces Matchers which
implement the optimal strategy for a particular value would be a good
way to do this.

A single matcher could be used for really simple values and maybe a
MultipleMatcher could be used for multiple different value-checks.
Something like that will likely lead to the highest-performing filter
processing.

I did some tests on that. As I don't believe, that a (self-made) NFA-based solution could outperform Java's regular expression engine, I was looking for a different (simpler) algorithm.

I started with a rule-driven globbing algorithm, that supports

? (any char except "."),
# (any digit, using Character.isDigit)
$ (any letter, using Character.isAlphabetic)

as well as literal character sequences.

(Actually, I followed your suggestion. Your Matchers are my Rules together with a piece of code in a switch-case block. Using real classes/instances for the matchers requires method invocations and maybe instanceof tests. Both are adding extra overhead, so I decided to use a more C-like approach.)

That simple algorithm takes about 42 ns and so, is still 2 times slower than the HashSet test. I already made more than half the way down to support * and ** multi-matching wildcards. That implementation uses non-recursive backtracking, similar to the algorithms described at https://www.codeproject.com/Articles/5163931/Fast-String-Matching-with-Wildcards-Globs-and-Giti

With * and ** partly in place, time consumption is about 50 ns. The code additions for making the algorithm work on the many edge cases will very likely add more nanoseconds to the test so, we may soon end at 60 ns or even more. That's almost the same time required for evaluating a regular expression (without the time needed to create the Matcher instance).

The algorithm is optimized and uses only a few method calls and no OOP constructs (by using the Rules, which are like beans that specify what to match next, the whole logic can be implemented in a single method). But it's still not much faster than a Java regular expression test.

I don't believe, that it's worth to (self-)implement such a rather complex (error prone) and hard to understand (and maintain) algorithm, if it's not significantly faster than real Java regular expressions.

Anyhow, if performance should not degrade due to using wildcards in the allowed origins, the goal is not just to be better than Java regular expressions, but to be close to the HashSet test (~19 ns). And, if real regular expressions shall be used as well (enclosed between slashes), that all will not help much for these.

That's why I wanted to combine this with a HashMap-based (LRU-)cache, so that regular expressions must only be evaluated if the current request's origin is not yet in the cache. This cache's performance nearly equals the performance of the HashSet test (depending on whether real LRU is used or not). That way, the time required for testing a regular expression is no longer significant in the average case.

Finally, we should not only compare the performance of the different matching algorithms. The absolute absolute time values should also be compared to the request's overall processing time. It's about time intervals between ~20 and ~120 nano(!)seconds. I was using Tomcats Access Log with the %D placeholder, which reports the request's overall processing time (< Tomcat 10, reports milliseconds as integers). On my box, a request to a simple servlet, echoing some request properties (no database or filesystem access), takes between 0 and 2 milliseconds. Using the average of 1 millisecond (that is 1,000,000, nanoseconds), every regular expression test adds ~0,012% of extra time to the request.

In other words, we could easily test about 83 regular expressions per request in order to add ~1% extra processing time (only few people will configure that much different allowed origin patterns).

So, I still believe that adding support for specifying the CORS filter's allowed origins with wildcards (globbing) or by real regular expressions does not (and should not) depend on a new pattern matching algorithm. Java's regular expression engine is, in fact, not that slow (and the new algorithm is probably not significantly faster).

Compared with the request's overall processing time, even with Java's slow but approved regular expression engine, there's enough space for testing some dozens of regular expressions without degrading performance by more than one percent.

With a cache, the number of regular expression tests can even be minimized, so that the allowed-state of the presented origin can be determined by only two HashMap-lookups for most of the requests.

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to