Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Renato via Digitalmars-d-learn Sun, 14 Jan 2024 09:16:42 -0800

On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote:

On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:
I like to use a phone encoding problem to determine thestrenghtness and weaknesses of programming languages becausethis problem is easy enough I can write solutions in anylanguage in a few hours, but complex enough to exercise lotsof interesting parts of the language.
[...]
Hello Renato,

This seems to be quite a lot of calls:
```
======== Timer frequency unknown, Times are in Megaticks========
  Num          Tree        Func        Per
  Calls        Time        Time        Call
19204964 3761 3756 0 pure nothrowref @trusted immutable(char)[][]core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)
19204924 8957 3474 0 @safe voiddencoder.printTranslations(immutable(char)[][][dencoder.Key],dencoder.ISolutionHandler, immutable(char)[],immutable(char)[], immutable(char)[][])
```
This is when using the `words-quarter.txt` input (the`dictionary.txt` input seems to finish much faster, althoughstill slower than `java`/`rust`).
I also used only 100 phone numbers as input.
My final observation is that `words-quarter.txt` contains some1-letter inputs, (for example, `i` or `m`)...this may result ina large number of encoding permutations, which may explain thehigh number of recursion calls?
Jordan

The characteristics of the dictionary impact the number ofsolutions greatly. I explored this in my blog post about [CommonLisp, Part2](https://renato.athaydes.com/posts/revenge_of_lisp-part-2).

The fact there's a ridiculous amount of function calls is whythis problem can take minutes even without having to allocatemuch memory or print anything.

** I've managed to improve the D code enough that it is nowfaster than Common Lisp and the equivalent algorithm in Java.**

It took some profiling to do that, though... thanks to@Anonymouse for the suggestion to use Valgrind... with that, Iwas able to profile the code nicely (Valgrind works nicely withD, it doesn't even show mangled names!).


Here's what I did.

First: the solution using int128 was much faster than the BigIntsolution, as I had already mentioned. But when profiling that, itwas clear that for the problems with a very large number ofsolutions, GC became a problem:


```
--------------------------------------------------------------------------------
23,044,096,944 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------

7,079,878,197 (30.72%)???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true,true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]2,375,100,857 (10.31%)???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]1,971,210,820 ( 8.55%) ???:_aaInX[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]1,922,961,924 ( 8.34%) ???:_d_arraysetlengthT[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]1,298,747,622 ( 5.64%) ???:core.int128.mul(core.int128.Cent,core.int128.Cent)[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]1,134,644,706 ( 4.92%)???:core.internal.gc.bits.GCBits.setLocked(ulong)[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]849,587,834 ( 3.69%)???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong,ref ulong, uint, const(TypeInfo))[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]827,407,988 ( 3.59%)./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]688,845,027 ( 2.99%)./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]575,415,884 ( 2.50%)???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]562,146,592 ( 2.44%)???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]526,067,586 ( 2.28%)???:core.internal.spinlock.SpinLock.lock() shared[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]

```

So, I [decided to use`std.container.Array`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) (I had tried Appender before but that didn't really work) to make sure no allocation was happening when adding/removing words from the `words` which are carried through the `printTranslations` calls (that's the hot path).

As you can see in the commit, the code is just slightly lessconvenient...

this made the code run considerably faster for the very long runs!

Here's the Callgrind profiling data AFTER this change:

```

6,547,316,944 (32.42%)???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]5,229,076,596 (25.90%) ???:_aaInX[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]3,871,644,402 (19.17%) ???:core.int128.mul(core.int128.Cent,core.int128.Cent)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]677,533,800 ( 3.36%) ???:std.int128.Int128.toHash() const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]543,388,688 ( 2.69%)???:std.int128.Int128.this(core.int128.Cent)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]543,388,688 ( 2.69%) ???:std.int128.Int128.this(long, long)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]542,027,045 ( 2.68%) ???:object.TypeInfo_Struct.getHash(scopeconst(void*)) const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]424,849,937 ( 2.10%) ???:object.TypeInfo_Struct.equals(invoid, in void) const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]

```

No more GC! It seems that now, as expected, most time is spentcalculating hashes for each possible solution (as it should be).I am not entirely sure what the `_aaInX` call is, but I believethat's the associative array's membership check?


The D profiler seems to show similar data:

```
======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

104001600 158 158 0 const purenothrow @nogc @safe std.int128.Int128std.int128.Int128.opBinary!("*").opBinary(std.int128.Int128)402933936 63 63 0 const purenothrow @property @nogc @safe boolstd.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()183744166 87 59 0 inout purenothrow ref @property @nogc return @safeinout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()103784880 68 38 0 const purenothrow @nogc @safe std.int128.Int128std.int128.Int128.opBinary!("+", int).opBinary(const(int))103784880 99 31 0 pure nothrowref @nogc @safe std.int128.Int128std.int128.Int128.opOpAssign!("+", int).opOpAssign(int)104001600 30 30 0 const purenothrow @nogc @safe std.int128.Int128std.int128.Int128.opBinary!("+").opBinary(std.int128.Int128)61167377 67 29 0 pure nothrow@nogc voidstd.container.array.Array!(immutable(char)[]).Array.__fieldDtor()43444575 61 27 0 pure nothrow@nogc @safe voidcore.internal.lifetime.emplaceRef!(immutable(char)[],immutable(char)[], immutable(char)[]).emplaceRef(refimmutable(char)[], ref immutable(char)[])48427530 79 27 0 const purenothrow @property @nogc @safe boolstd.container.array.Array!(immutable(char)[]).Array.empty()61167377 38 27 0 pure nothrow@nogc voidstd.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__dtor()43444575 130 24 0 pure nothrow@nogc ulongstd.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])61167376 32 23 0 pure nothrow@nogc @safe voidstd.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__postblit()

```

Anyway, after this change, the code now scaled much better.

But there's nothing much lest to be optimised... except thearithmetics (notice how the `("*").opBinary` call is near the top.

I know how to make that faster using a similar strategy that Iused in Rust (you can check my [journey to optimise the Rust codeon my blogpost](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about that - specifically, check the `Using packed bytes for more efficient storage` section)... knowing how the encoding works, I knew that multiplication is not really needed. So, instead of doing this:


```d
n *= 10;
n += c - '0';
```

I could do this:

```d
n <<= 4;
n += c;
```

This changes the actual number, but that doesn't matter: the codeis still unique for each number, which is all that we need.

You can see [the full commithere](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c).


This improved the performance for sure, even if not by much.

The profiling data after this arithmetic trick looks like this:

Valgrind:

```

4,821,217,799 (35.75%)???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]4,241,404,850 (31.45%) ???:_aaInX[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]1,038,733,782 ( 7.70%) ???:core.int128.shl(core.int128.Cent,uint)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]575,372,270 ( 4.27%) ???:std.int128.Int128.toHash() const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]461,659,456 ( 3.42%)???:std.int128.Int128.this(core.int128.Cent)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]460,297,816 ( 3.41%) ???:object.TypeInfo_Struct.getHash(scopeconst(void*)) const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]347,198,989 ( 2.57%) ???:object.TypeInfo_Struct.equals(invoid, in void) const[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]288,537,160 ( 2.14%) ???:core.int128.add(core.int128.Cent,core.int128.Cent)[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]

```

D Profiler:

```
======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

29879388 44037 11267 0 voiddencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)126827950 4306 3599 0 inout purenothrow ref @property @nogc return @safeinout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()29879268 7293 3100 0 pure nothrow@nogc ulongstd.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])33534720 4896 2780 0 const purenothrow @property @nogc @safe boolstd.container.array.Array!(immutable(char)[]).Array.empty()29879268 11834 2479 0 pure nothrow@nogc ulongstd.container.array.Array!(immutable(char)[]).Array.insertBack!(immutable(char)[]).insertBack(immutable(char)[])29879268 8702 2279 0 pure nothrow@nogc @safe voidstd.container.array.Array!(immutable(char)[]).Array.removeBack()282919518 2045 2045 0 const purenothrow @property @nogc @safe boolstd.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()29879268 2534 1859 0 pure nothrow@nogc @safe voidcore.internal.lifetime.emplaceRef!(immutable(char)[],immutable(char)[], immutable(char)[]).emplaceRef(refimmutable(char)[], ref immutable(char)[])44511077 3264 1505 0 pure nothrow@nogc voidstd.container.array.Array!(immutable(char)[]).Array.__fieldDtor()44511076 2588 1254 0 pure nothrow@nogc scope voidstd.container.array.Array!(immutable(char)[]).Array.__fieldPostblit()49758574 1684 1243 0 const purenothrow @nogc @safe std.int128.Int128std.int128.Int128.opBinary!("+", char).opBinary(const(char))49758574 2907 1223 0 pure nothrow ref@nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+",immutable(char)).opOpAssign(ref immutable(char))49975294 1796 1208 0 pure nothrow ref@nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("<<",int).opOpAssign(int)

```

As far as I can tell, the only two bottlenecks now bacame:

* `std.typecons.RefCounted!(Array.Payload,0).RefCounted.refCountedPayload()`

* the `Int128` implementation!

To fix the former, I would need to implement my own `Array`, Isuppose... using ref-counting here seems unnecessary??


But that is a bit beyond what I am willing to do!

The latter could probably be fixed by not using a numeric key atall, just bytes - but my previous attempt at doing that didn'treally give better results. Or perhaps by overriding the hashingfor Int128 (I didn't try that because I assume the default shouldbe pretty optimal already)??


So, I ran out of options and am going to call it done.

[Here's my final Dsolution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d).

The code is still readable (I think everyone can agree D is oneof the most readable languages of all)!

You can see how it performs in the last comparison run I did([all data, including a nice chart, in thisgist](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334)):


```
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,3190730752,441
java-Main,3194789888,1440
java-Main,3193692160,2316
java-Main,3194720256,3604
java-Main,3192963072,12861
java-Main,3261128704,31282
===> sbcl --script src/lisp/main.fasl
sbcl,1275994112,83
sbcl,1275998208,1396
sbcl,1275998208,5925
sbcl,1275998208,9752
sbcl,1275998208,64811
sbcl,1276006400,153799
===> ./rust
./rust,23138304,56
./rust,23138304,234
./rust,23138304,1288
./rust,23138304,2444
./rust,9027584,15867
./rust,9027584,36985
===> src/d/dencoder
src/d/dencoder,219041792,67
src/d/dencoder,229904384,1114
src/d/dencoder,229904384,4421
src/d/dencoder,229904384,9087
src/d/dencoder,219041792,51315
src/d/dencoder,219041792,120818
```

## Conclusion

Using `Int128` as a key instead of `BigInt` made the code muchfaster (around 4x faster).

Using `Array` to avoid too many reallocations with primitivedynamic arrays made the code run much faster for very largenumbers of calls (very little difference when running for lessthan 10s, but at least 2x faster at round 1min runs where thetiny overhead adds up).

As you can see [in the chart Iposted](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334), D's speed performance is close to that of Common Lisp for all runs, though between 10 and 20% faster. It's nowhere near Rust, unfortunately, with Rust being almost 4x faster (pls ignore the Java solution as that's using a Trie instead of the "numeric" solution - so it's not a fair comparison). I had expected to get very close to Rust, but that didn't happen... I just can't see in the profiling data what's causing D to fall so far behind!

On the memory data: D uses much less memory than Java and CommonLisp, but still a lot higher than Rust.

If anyone can find any flaw in my methodology or optmise my codeso that it can still get a couple of times faster, approachingRust's performance, I would greatly appreciate that! But for now,my understanding is that the most promising way to get therewould be to write D in `betterC` style?!

Re: Help optimize D solution to phone encoding problem: extremely slow performace.

Reply via email to