Manybubbles has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/226122

Change subject: WIP: Speed up the regex's recheck phase
......................................................................

WIP: Speed up the regex's recheck phase

Here are timings for 1000 iterations of rechecking the Barack Obama and
Rashidun Caliphate articles from English wikipedia:

            slow case insensitive took 10474 millis to match /\[\[Category:/
non backtracking case insensitive took 4876 millis to match /\[\[Category:/
 case converting case insensitive took 4581 millis to match /\[\[Category:/
            slow case insensitive took 7692 millis to match /cat/
non backtracking case insensitive took 2477 millis to match /cat/
 case converting case insensitive took 107 millis to match /cat/
              slow case sensitive took 7399 millis to match /\[\[Category:/
  non backtracking case sensitive took 2230 millis to match /\[\[Category:/
              slow case sensitive took 5370 millis to match /cat/
  non backtracking case sensitive took 52 millis to match /cat/

These numbers mean:
1. For case sensitive queries the recheck phase gets somewhere between an
order of magnitude and 3 times faster. ~2 seconds for 2000 rechecks is still
not great but its better. 52 milliseconds for the same number of rechecks is
zippy.
2. For case insensitive queries the recheck phase gets faster as well but
the depth of the match from the front of the document causes a more pronounced
slowdown. In the worst case the recheck is still twice as fast as it once
was.

WIP because it now needs testing in irish, turkish, and greek.

Change-Id: I0e14d0436b23425bcdbab65e85a5e6d775cf8e5d
---
M pom.xml
A 
src/main/java/org/apache/lucene/util/automaton/ContainsCharacterRunAutomaton.java
M src/main/java/org/wikimedia/search/extra/regex/SourceRegexFilter.java
A 
src/test/java/org/wikimedia/search/extra/regex/SourceRegexFilterRecheckTest.java
M src/test/java/org/wikimedia/search/extra/regex/SourceRegexFilterTest.java
A src/test/resources/Barack Obama.txt
A src/test/resources/README
A src/test/resources/Rashidun Caliphate.txt
8 files changed, 1,706 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/extra 
refs/changes/22/226122/1


-- 
To view, visit https://gerrit.wikimedia.org/r/226122
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0e14d0436b23425bcdbab65e85a5e6d775cf8e5d
Gerrit-PatchSet: 1
Gerrit-Project: search/extra
Gerrit-Branch: master
Gerrit-Owner: Manybubbles <never...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to