Greetings Dlang wizards,
I seek knowledge/understanding of a very frustrating phenomenon
I've experienced over the past several days.
The problem space:
1) Read a list of strings from a file
2) De-duplicate all strings into the subset of unique strings
3) Sort the subset of unique strings by descending length and
then by ascending lexicographic identity
4) Iterate through the sorted subset of unique strings,
identifying smaller sequences with perfect identity to their
largest possible parent string
I have written a Dlang program that performantly progresses
through step #3 above. I used a built-in AA (associative array)
to uniquely de-duplicate the initial set of strings and then used
multiSort(). Performance was good up till this point, especially
with use of the LDC compiler.
Things went sideways at step #4: because multiSort() returns a
SortedRange, I used .array to convert the returned SortedRange
into an array of type string[]. This appeared to work, and
neither DMD nor LDC threw any warnings/errors for doing this.
With the formally returned array, I then attempted to construct a
double foreach loop to iterate through the sorted array of unique
strings and find substring matches.
foreach( i, ref pStr; sortedArr )
{
foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
{
if( indexOf( pStr, cStr ) > -1 )
{
// ...
}
}
}
Before adding the code excerpt above, the Dlang program was
taking ~1 second on an input file containing approx. 64,000
strings.
By adding the code above, the program now takes 6 minutes to
complete. An attempt was made to more efficiently perform
ASCII-only substring searching by converting the sorted string[]
into ubyte[][] and then using countUntil() instead of indexOf(),
but this had an effect that was completely opposite to what I had
previously experienced: the program then took over 20 minutes to
complete!
Thus, I am entirely baffled.
My first attempt to solve this problem space used a small Perl
program to perform steps 1 through 3, which would then pipe
intermediate output to a small Dlang program handling only step
#4 using dynamic arrays (no use of AAs) of ubyte[][] with use of
countUntil().
The Dlang code for the nested foreach block above is essentially
near-identical between my two Dlang implementations. Yet, the
second implementation--where I'm trying to solve the entire
problem space in D--has absolutely failed in terms of performance.
Perl+D, ubyte[][], countUntil() :: under 2 seconds
only D, string[], indexOf() :: ~6 minutes
only D, ubyte[][], countUntil() :: >20 minutes
Please advise. This nightmarish experience is shaking my
confidence in using D.