[basex-talk] Questions about the performance of bulk string matching

Zimmel, Daniel Wed, 29 Dec 2021 03:13:41 -0800

Hi,

recently I ran into serious (as in SERIOUS) performance trouble regarding 
expensive regexes, and no wonder why.


Here is my scenario:

* XML text documents with a total of 1m text nodes
* A regex search string, including a huge string dictionary list of 50.000 
strings (some of them containing 50 words each)
* a match must be wrapped in an element (= create a link)

I could drink many cups of tea while waiting for the regex to complete... 
hours... I ran out of tea!

Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk 
String Matching" [1] in which she explores the Aho-Corasick algorithm, 
implementing it with XQuery - marvellous! Following this, while I can build a 
data structure which gives me fast results, building the same structure is 
still very slow due to the amount of text to build from. So this was not fast 
enough for my use case - or I may simply not be smart enough to apply it 
correctly :-|

So, I tried tinkering with maps which turned out to give me extraordinary 
performance gains:

* build a map from the string dictionary
* walk through all text nodes one by one
* for each text node, put any combination of words in the text node in word 
order (I need to find strings, not words) into another map
* strip punctuation (for word boundaries) and do some normalization of 
whitespaces in both maps
* compare the keys of both maps
* give the new reduced string dictionary to the regular regex search

While comparing the maps, I do not know where in the text my strings are, but 
at least I know if they are in there - to find out where exactly (and how do 
they fit my other regex context) I can then use a massively reduced regular 
regex search. Fast!

I am quite happy BUT I still cannot understand why this is so much faster for 
my sample data:

* plain method : 51323.72ms
* maps method: 597.94ms(!)

Is this due to any optimizations done by BaseX or have I simply discovered the 
power of maps?
How would you do it? Is there still room for improvement? 
Why does Aho-Corasick not help much with this scenario? Is it because the list 
of strings is simply too massive?
Why is this so much faster with text-splitting-to-map?

See below for the query examples to better understand what I am trying to do 
(bulk data not included) [2],[3]
There is no normalization of punctuation in the examples but that is only 
necessary for completeness.

Best, Daniel

[1] 
http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Holstege01.html

[2] plain method

let $textnodes := 
fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
let $strings := 
file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $regex := $strings
for $textnode in $textnodes
where  matches($textnode,$regex) = true()
return $textnode

[3] maps method

(:~
 : Create map from string
 : Tokenization
 :
 : @param $strings
 : @return Map
 :)
declare function local:createMapFromString (
  $string as xs:string
) as map(*) {
  let $map_words :=  
              map:merge(
              for $string in tokenize($string,'\|')
              let $key := $string
              let $val := $string
              return map:entry($key,$val),
                map { 'duplicates': 'use-first' })
return
  $map_words
};

(:~
 : Create map from text nodes
 : Write any combination of words in document order to the map
 :
 : @param $textnodes
 : @return Map
 :)
declare function local:createMapFromTextnodes (
  $textnodes as xs:string+
) as map(*) {
      map:merge(
      for $node in $textnodes
      let $text := normalize-space($node)
      let $tokens := tokenize($text,' ')
      let $map_nodes :=
                    map:merge(
                     for $start in 1 to fn:count($tokens)   
                      for $length in 1 to fn:count($tokens) - $start + 1
                     return
                       map:entry(fn:string-join(fn:subsequence($tokens, $start, 
$length), ' '),'x')
                     )

      return
        $map_nodes)
};

(:~
 : Compare two maps
 :
 : @param $map1
 : @param $map2
 : @return xs:string*
 :)
declare function local:reduceMaps (
  $map1 as map(*),
  $map2 as map(*)
) as xs:string* {
      for $key in map:keys($map1)
      where map:contains($map2,$key)
      let $value := map:get($map1,$key)
      return $value
};

let $textnodes := 
fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
let $strings := 
file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')

let $map_words := local:createMapFromString($strings)
let $map_textnodes := local:createMapFromTextnodes($textnodes)
let $matches :=  local:reduceMaps($map_words,$map_textnodes)

let $regex := string-join(for $match in $matches 
              group by $match 
              return $match,'|')

for $textnode in $textnodes
where  matches($textnode,$regex) = true()
return $textnode

[basex-talk] Questions about the performance of bulk string matching

Reply via email to