On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:
On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn
wrote:
On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
python -c "print(('a' * 49 + 'b') * 20000)" > test.lst
That's generating a file with a single line:
$> wc -l test.lst
1 test.lst
If you insist on having 100K lines in the input data, then you
can try a different test (run it as a shell script):
```bash
python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' *
10000)" > words.txt
cp words.txt test.lst
wc -l words.txt
NEEDLE=`python -c "print('a' * 10000, end='')"`
echo "=== Python ==="
time python search.py "$NEEDLE"
echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"
echo "=== Crystal ==="
time ./search-cr "$NEEDLE"
echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"
```
It is testing a 1MB file with 100K lines and a 10K characters
long needle. Results from my computer (with Python 3.10, because
earlier versions of Python are slower):
```
100000 words.txt
=== Python ===
1
real 0m0.083s
user 0m0.072s
sys 0m0.010s
=== Ruby ===
1
real 0m0.313s
user 0m0.313s
sys 0m0.000s
=== Crystal ===
1
real 0m1.492s
user 0m1.482s
sys 0m0.010s
=== D (LDC) ===
1
real 1m10.314s
user 1m10.234s
sys 0m0.050s
```
Going with an appropriate 100k mixed line file and your
mentioned needle, D is still quite a bit slower, but the
results aren't as drastic
Your "appropriate" file is also entirely artificial and happens
to be specifically crafted to work best with the current
".canFind" implementation from Phobos. The primary source of
slowness are partial matches. Such as the `"int i = "` prefix
when searching for `"int i = 0;"` substring in a string, which
contains `"int i = 1;"`. The time is wasted on comparing the
initial 8 characters before encountering a mismatch and bailing
out. But when searching in a randomly generated gibberish, the
chances of encountering long partial matches are very low. At
least lower than in the real text files.
and nowhere near "two orders of magnitude".
Does the difference really have to be two orders of magnitude for
you to acknowledge that there might be a performance problem in
Phobos? Moreover, you are quoting torhu, who compared two D
implementations compiled by DMD (regex vs. canFind) rather than
standard libraries of different programming languages.
```D
import std;
void main(string[] args) {
enforce(args.length > 1, "Need a needle argument.");
auto needle = args[1].toLower;
File("words.txt").byLine.count!(ln =>
ln.asLowerCase.canFind(needle)).writeln;
}
```
It's a nicely looking one-liner implementation in D language and
everything is fine. Except that similar one-liners implemented
using other programming languages are faster and more versatile
(can handle any input data without catastrophic performance
pitfalls).
BTW, changing ".asLowerCase" to ".toLower" introduces an extra
memory allocation, but still significantly improves handling of
the worst case. Conversion to lowercase is expensive for unicode
characters and revisiting the same character to repeat such
conversion again and again is bad for performance.