On Tuesday, 26 January 2021 at 18:17:31 UTC, H. S. Teoh wrote:
Do not do this. Every time you call .array it allocates a new array and copies all its contents over. If this code runs frequently, it will cause a big performance hit, not to mention high GC load.

The function you're looking for is .release, not .array.

Many thanks for the tip! I look forward to trying this soon. For reference, the .array call is only performed once.

That nested loop is an O(n^2) algorithm. Meaning it will slow down *very* quickly as the size of the array n increases. You might want to think about how to improve this algorithm.

Nice observation, and yes, this would typically be an O(n^2) approach.

However, due to subsetting the input dataset to unique strings and then sorting in descending length, one might notice that the inner foreach loop does not iterate over all of n, only on the iterator value i+1 through the end of the array.

Thus, I believe this would then become approximately O(n^2/2). More precisely, it should be O( ( n^2 + n ) / 2 ).

Further: the original dataset has 64k strings. Squaring that yields 4.1 billion string comparisons.

Once uniquely de-duplicated, the dataset is reduced to ~46k strings. Considering roughly O(n^2/2) at this level, this yields 1.06 billion string comparisons.

So, performing steps 1 through 3 improves the brute-force string comparison problem four-fold in my test development dataset.

Using AA's may not necessarily improve performance. It depends on what your code does with it. Because AA's require random access to memory, it's not friendly to the CPU cache hierarchy, whereas traversing linear arrays is more cache-friendly and in some cases will out-perform AA's.

I figured a built-in AA might be an efficient path to performing unique string de-duplication. If there's a more performant method available, I'll certainly try it.

First of all, you need to use a profiler to identify where the hotspots are. Otherwise you could well be pouring tons of effort into "optimizing" code that doesn't actually need optimizing, while completely missing the real source of the problem. Whenever you run into performance problems, do not assume you know where the problem is, profile, profile, profile!

Message received. Given that D is the first compiled language I've semi-seriously dabbled with, I have no real experience with profiler usage.

Second, you only posted a small fragment of your code, so it's hard to say where the problem really is. I can only guess based on what you described. If you could post the entire program, or at least a complete, compilable and runnable excerpt thereof that displays the same (or similar) performance problems, then we could better help you pinpoint where the problem is.

Yes, I'll be looking to present a complete, compilable, and executable demo of code for this issue if/when subsequent efforts continue to fail.

For professional reasons (because I no longer work in academia), I cannot share the original source code for the issue presented here, but I can attempt to reproduce it in a minimally complete form for a public dataset.

Reply via email to