On Tuesday, 26 January 2021 at 18:17:31 UTC, H. S. Teoh wrote:
Do not do this. Every time you call .array it allocates a new
array and copies all its contents over. If this code runs
frequently, it will cause a big performance hit, not to mention
high GC load.
The function you're looking for is .release, not .array.
Many thanks for the tip! I look forward to trying this soon. For
reference, the .array call is only performed once.
That nested loop is an O(n^2) algorithm. Meaning it will slow
down *very* quickly as the size of the array n increases. You
might want to think about how to improve this algorithm.
Nice observation, and yes, this would typically be an O(n^2)
approach.
However, due to subsetting the input dataset to unique strings
and then sorting in descending length, one might notice that the
inner foreach loop does not iterate over all of n, only on the
iterator value i+1 through the end of the array.
Thus, I believe this would then become approximately O(n^2/2).
More precisely, it should be O( ( n^2 + n ) / 2 ).
Further: the original dataset has 64k strings. Squaring that
yields 4.1 billion string comparisons.
Once uniquely de-duplicated, the dataset is reduced to ~46k
strings. Considering roughly O(n^2/2) at this level, this yields
1.06 billion string comparisons.
So, performing steps 1 through 3 improves the brute-force string
comparison problem four-fold in my test development dataset.
Using AA's may not necessarily improve performance. It depends
on what your code does with it. Because AA's require random
access to memory, it's not friendly to the CPU cache hierarchy,
whereas traversing linear arrays is more cache-friendly and in
some cases will out-perform AA's.
I figured a built-in AA might be an efficient path to performing
unique string de-duplication. If there's a more performant method
available, I'll certainly try it.
First of all, you need to use a profiler to identify where the
hotspots are. Otherwise you could well be pouring tons of
effort into "optimizing" code that doesn't actually need
optimizing, while completely missing the real source of the
problem. Whenever you run into performance problems, do not
assume you know where the problem is, profile, profile, profile!
Message received. Given that D is the first compiled language
I've semi-seriously dabbled with, I have no real experience with
profiler usage.
Second, you only posted a small fragment of your code, so it's
hard to say where the problem really is. I can only guess
based on what you described. If you could post the entire
program, or at least a complete, compilable and runnable
excerpt thereof that displays the same (or similar) performance
problems, then we could better help you pinpoint where the
problem is.
Yes, I'll be looking to present a complete, compilable, and
executable demo of code for this issue if/when subsequent efforts
continue to fail.
For professional reasons (because I no longer work in academia),
I cannot share the original source code for the issue presented
here, but I can attempt to reproduce it in a minimally complete
form for a public dataset.