Manybubbles has uploaded a new change for review. https://gerrit.wikimedia.org/r/226122
Change subject: WIP: Speed up the regex's recheck phase ...................................................................... WIP: Speed up the regex's recheck phase Here are timings for 1000 iterations of rechecking the Barack Obama and Rashidun Caliphate articles from English wikipedia: slow case insensitive took 10474 millis to match /\[\[Category:/ non backtracking case insensitive took 4876 millis to match /\[\[Category:/ case converting case insensitive took 4581 millis to match /\[\[Category:/ slow case insensitive took 7692 millis to match /cat/ non backtracking case insensitive took 2477 millis to match /cat/ case converting case insensitive took 107 millis to match /cat/ slow case sensitive took 7399 millis to match /\[\[Category:/ non backtracking case sensitive took 2230 millis to match /\[\[Category:/ slow case sensitive took 5370 millis to match /cat/ non backtracking case sensitive took 52 millis to match /cat/ These numbers mean: 1. For case sensitive queries the recheck phase gets somewhere between an order of magnitude and 3 times faster. ~2 seconds for 2000 rechecks is still not great but its better. 52 milliseconds for the same number of rechecks is zippy. 2. For case insensitive queries the recheck phase gets faster as well but the depth of the match from the front of the document causes a more pronounced slowdown. In the worst case the recheck is still twice as fast as it once was. WIP because it now needs testing in irish, turkish, and greek. Change-Id: I0e14d0436b23425bcdbab65e85a5e6d775cf8e5d --- M pom.xml A src/main/java/org/apache/lucene/util/automaton/ContainsCharacterRunAutomaton.java M src/main/java/org/wikimedia/search/extra/regex/SourceRegexFilter.java A src/test/java/org/wikimedia/search/extra/regex/SourceRegexFilterRecheckTest.java M src/test/java/org/wikimedia/search/extra/regex/SourceRegexFilterTest.java A src/test/resources/Barack Obama.txt A src/test/resources/README A src/test/resources/Rashidun Caliphate.txt 8 files changed, 1,706 insertions(+), 20 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/search/extra refs/changes/22/226122/1 -- To view, visit https://gerrit.wikimedia.org/r/226122 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I0e14d0436b23425bcdbab65e85a5e6d775cf8e5d Gerrit-PatchSet: 1 Gerrit-Project: search/extra Gerrit-Branch: master Gerrit-Owner: Manybubbles <never...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits