Re: strutils find

2019-03-25 Thread adrianv
@dataman thanks that was a really helpful link.

@araq I tested my simple hash algorithm against brute forth and BM with that 
smart tool. It always wins against brute forth. Against BM it wins always for 
pattern lengths under 16. Given that the BM has always an overhead for the 
first search I would use a heuristic with pattern under 16 or search text under 
100 char.


Re: strutils find

2019-03-21 Thread dataman
There is awesome [String Matching Algorithms Research 
Tool](https://smart-tool.github.io/smart/). 


Re: strutils find

2019-03-21 Thread Araq
Well one easy heuristic is to only use the lookup table if the search 
substring's length is >= 4 and otherwise use the bruteforce algorithm. It's of 
course tricky to figure out a good criterion for a standard library 
implementation. PRs are welcome.


strutils find

2019-03-21 Thread adrianv
playing around with text search (strutils.find) I realized, that the nim find 
implementation uses a Boyer-Moore algorithm. This is nice in many cases, but 
there are scenarios where a naive brute force search is faster (small search 
patterns and/or a simple one shot search like find("nim is a great programming 
language", "great")). The situation gets worse, when a stupid user like me uses 
the find repeatedly without precalculating the SkipTable. - I know stupid 
programmers write stupid code.

Now I tried to find a better find algorithm. After some googling my first stop 
was a Knuth-Morris-Pratt, but in all my benchmarks a simple brute force find 
was better (Maybe someone can show me an artificial search string and pattern 
where KMP shines? ). Than I had a genius idea and "invented" a rolling hash 
search. After implementing it, I googled again and found out that I 
re-"invented" a naive form of the Rabin-Karp algorithm :) But to my 
astonishment my naive implementation works very well. In all my tests, it is 
always better than the naive brute force search. In many cases it beats the BM 
and when it looses it is not so bad. But I want to make some more tests to be 
sure. Can someone show me some real world test where BM shines - my current 
observation is the longer the pattern, the better is BM.

When it turns out like I guess, I would like to suggest the following for the 
strutils find:

  * find a cut off size for patterns where BM is likely to be better than RK
  * for smaller patterns use the RK
  * for longer patterns use the BM



what do you think ?