Re: Replacing tango.text.Ascii.isearch

Siarhei Siamashka via Digitalmars-d-learn Sat, 08 Oct 2022 18:11:50 -0700

On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:

On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learnwrote:

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
python -c "print(('a' * 49 + 'b') * 20000)" > test.lst


That's generating a file with a single line:

$> wc -l test.lst
1 test.lst

If you insist on having 100K lines in the input data, then youcan try a different test (run it as a shell script):


```bash

python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' *10000)" > words.txt

cp words.txt test.lst
wc -l words.txt

NEEDLE=`python -c "print('a' * 10000, end='')"`

echo "=== Python ==="
time python search.py "$NEEDLE"

echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"

echo "=== Crystal ==="
time ./search-cr "$NEEDLE"

echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"
```

It is testing a 1MB file with 100K lines and a 10K characterslong needle. Results from my computer (with Python 3.10, becauseearlier versions of Python are slower):


```
100000 words.txt
=== Python ===
1

real    0m0.083s
user    0m0.072s
sys     0m0.010s
=== Ruby ===
1

real    0m0.313s
user    0m0.313s
sys     0m0.000s
=== Crystal ===
1

real    0m1.492s
user    0m1.482s
sys     0m0.010s
=== D (LDC) ===
1

real    1m10.314s
user    1m10.234s
sys     0m0.050s
```

Going with an appropriate 100k mixed line file and yourmentioned needle, D is still quite a bit slower, but theresults aren't as drastic

Your "appropriate" file is also entirely artificial and happensto be specifically crafted to work best with the current".canFind" implementation from Phobos. The primary source ofslowness are partial matches. Such as the `"int i = "` prefixwhen searching for `"int i = 0;"` substring in a string, whichcontains `"int i = 1;"`. The time is wasted on comparing theinitial 8 characters before encountering a mismatch and bailingout. But when searching in a randomly generated gibberish, thechances of encountering long partial matches are very low. Atleast lower than in the real text files.

and nowhere near "two orders of magnitude".

Does the difference really have to be two orders of magnitude foryou to acknowledge that there might be a performance problem inPhobos? Moreover, you are quoting torhu, who compared two Dimplementations compiled by DMD (regex vs. canFind) rather thanstandard libraries of different programming languages.

```D
import std;
void main(string[] args) {
    enforce(args.length > 1, "Need a needle argument.");
    auto needle = args[1].toLower;
File("words.txt").byLine.count!(ln =>ln.asLowerCase.canFind(needle)).writeln;
}
```

It's a nicely looking one-liner implementation in D language andeverything is fine. Except that similar one-liners implementedusing other programming languages are faster and more versatile(can handle any input data without catastrophic performancepitfalls).

BTW, changing ".asLowerCase" to ".toLower" introduces an extramemory allocation, but still significantly improves handling ofthe worst case. Conversion to lowercase is expensive for unicodecharacters and revisiting the same character to repeat suchconversion again and again is bad for performance.

Re: Replacing tango.text.Ascii.isearch

Reply via email to