GitHub user chenlica edited a discussion: Aho Corasick String Matching Algorithm

This discussion saved the content of the old wiki page 
https://github.com/apache/texera/wiki/Aho-Corasick-String-Matching-Algorithm 
(may be dangling).

==========

Aho-Corasick Algorithm is a dictionary-matching algorithm to output all matched 
dictionary entries with a single pass on the input text.     

Wiki Page: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm    

The linear time complexity can help to speed up the process of dictionary 
matcher on substring matching type.    

Given all the dictionary entries as the input, Dictionary Matcher firstly 
preprocess them to construct an automaton once and save for later data stream 
to match.   
The automaton starts with a prefix trie of all the words to be matched. 
![](http://www.geeksforgeeks.org/wp-content/uploads/aho-corasick1.png)    

Then three main functions including success transactions, failure transactions 
and output matching words for each trie node will be set up using bread first 
search traversal on the trie:    
The success transactions follows the edges in the trie to find the children of 
current trie node.    
The failure transactions set up links between failed string matches and the 
nodes on other branches which share the longest common suffix.     
The output list stores all the words ending at current node and its failure 
node.
![](http://www.geeksforgeeks.org/wp-content/uploads/aho-corasick2.png)        

When executing on the input text, the algorithm traverse the graph by firstly 
following the success transactions to the child node, if it doesn't exist, then 
following the failure transactions to its proper suffix node.        
 
When the algorithm reaches a node with a non-empty output keyword list, it 
outputs all the matched dictionary entires that ends at current character 
position in the input text.


GitHub link: https://github.com/apache/texera/discussions/3951

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to