On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote:
On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:
I like to use a phone encoding problem to determine the strenghtness and weaknesses of programming languages because this problem is easy enough I can write solutions in any language in a few hours, but complex enough to exercise lots of interesting parts of the language.

[...]

Hello Renato,

This seems to be quite a lot of calls:
```
======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

19204964 3761 3756 0 pure nothrow ref @trusted immutable(char)[][] core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)

19204924 8957 3474 0 @safe void dencoder.printTranslations(immutable(char)[][][dencoder.Key], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])
```

This is when using the `words-quarter.txt` input (the `dictionary.txt` input seems to finish much faster, although still slower than `java`/`rust`).

I also used only 100 phone numbers as input.

My final observation is that `words-quarter.txt` contains some 1-letter inputs, (for example, `i` or `m`)...this may result in a large number of encoding permutations, which may explain the high number of recursion calls?

Jordan

The characteristics of the dictionary impact the number of solutions greatly. I explored this in my blog post about [Common Lisp, Part 2](https://renato.athaydes.com/posts/revenge_of_lisp-part-2).

The fact there's a ridiculous amount of function calls is why this problem can take minutes even without having to allocate much memory or print anything.

** I've managed to improve the D code enough that it is now faster than Common Lisp and the equivalent algorithm in Java.**

It took some profiling to do that, though... thanks to @Anonymouse for the suggestion to use Valgrind... with that, I was able to profile the code nicely (Valgrind works nicely with D, it doesn't even show mangled names!).

Here's what I did.

First: the solution using int128 was much faster than the BigInt solution, as I had already mentioned. But when profiling that, it was clear that for the problems with a very large number of solutions, GC became a problem:

```
--------------------------------------------------------------------------------
23,044,096,944 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
7,079,878,197 (30.72%) ???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 2,375,100,857 (10.31%) ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,971,210,820 ( 8.55%) ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,922,961,924 ( 8.34%) ???:_d_arraysetlengthT [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,298,747,622 ( 5.64%) ???:core.int128.mul(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 1,134,644,706 ( 4.92%) ???:core.internal.gc.bits.GCBits.setLocked(ulong) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 849,587,834 ( 3.69%) ???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 827,407,988 ( 3.59%) ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 688,845,027 ( 2.99%) ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 575,415,884 ( 2.50%) ???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 562,146,592 ( 2.44%) ???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128] 526,067,586 ( 2.28%) ???:core.internal.spinlock.SpinLock.lock() shared [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
```

So, I [decided to use `std.container.Array`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) (I had tried Appender before but that didn't really work) to make sure no allocation was happening when adding/removing words from the `words` which are carried through the `printTranslations` calls (that's the hot path).

As you can see in the commit, the code is just slightly less convenient...
this made the code run considerably faster for the very long runs!

Here's the Callgrind profiling data AFTER this change:

```
6,547,316,944 (32.42%) ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 5,229,076,596 (25.90%) ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 3,871,644,402 (19.17%) ???:core.int128.mul(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 677,533,800 ( 3.36%) ???:std.int128.Int128.toHash() const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 543,388,688 ( 2.69%) ???:std.int128.Int128.this(core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 543,388,688 ( 2.69%) ???:std.int128.Int128.this(long, long) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 542,027,045 ( 2.68%) ???:object.TypeInfo_Struct.getHash(scope const(void*)) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 424,849,937 ( 2.10%) ???:object.TypeInfo_Struct.equals(in void, in void) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
```

No more GC! It seems that now, as expected, most time is spent calculating hashes for each possible solution (as it should be). I am not entirely sure what the `_aaInX` call is, but I believe that's the associative array's membership check?

The D profiler seems to show similar data:

```
======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

104001600 158 158 0 const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("*").opBinary(std.int128.Int128) 402933936 63 63 0 const pure nothrow @property @nogc @safe bool std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized() 183744166 87 59 0 inout pure nothrow ref @property @nogc return @safe inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload() 103784880 68 38 0 const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+", int).opBinary(const(int)) 103784880 99 31 0 pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+", int).opOpAssign(int) 104001600 30 30 0 const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+").opBinary(std.int128.Int128) 61167377 67 29 0 pure nothrow @nogc void std.container.array.Array!(immutable(char)[]).Array.__fieldDtor() 43444575 61 27 0 pure nothrow @nogc @safe void core.internal.lifetime.emplaceRef!(immutable(char)[], immutable(char)[], immutable(char)[]).emplaceRef(ref immutable(char)[], ref immutable(char)[]) 48427530 79 27 0 const pure nothrow @property @nogc @safe bool std.container.array.Array!(immutable(char)[]).Array.empty() 61167377 38 27 0 pure nothrow @nogc void std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__dtor() 43444575 130 24 0 pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[]) 61167376 32 23 0 pure nothrow @nogc @safe void std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__postblit()
```

Anyway, after this change, the code now scaled much better.

But there's nothing much lest to be optimised... except the arithmetics (notice how the `("*").opBinary` call is near the top.

I know how to make that faster using a similar strategy that I used in Rust (you can check my [journey to optimise the Rust code on my blog post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about that - specifically, check the `Using packed bytes for more efficient storage` section)... knowing how the encoding works, I knew that multiplication is not really needed. So, instead of doing this:

```d
n *= 10;
n += c - '0';
```

I could do this:

```d
n <<= 4;
n += c;
```

This changes the actual number, but that doesn't matter: the code is still unique for each number, which is all that we need.

You can see [the full commit here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c).

This improved the performance for sure, even if not by much.

The profiling data after this arithmetic trick looks like this:

Valgrind:

```
4,821,217,799 (35.75%) ???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 4,241,404,850 (31.45%) ???:_aaInX [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 1,038,733,782 ( 7.70%) ???:core.int128.shl(core.int128.Cent, uint) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 575,372,270 ( 4.27%) ???:std.int128.Int128.toHash() const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 461,659,456 ( 3.42%) ???:std.int128.Int128.this(core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 460,297,816 ( 3.41%) ???:object.TypeInfo_Struct.getHash(scope const(void*)) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 347,198,989 ( 2.57%) ???:object.TypeInfo_Struct.equals(in void, in void) const [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder] 288,537,160 ( 2.14%) ???:core.int128.add(core.int128.Cent, core.int128.Cent) [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
```

D Profiler:

```
======== Timer frequency unknown, Times are in Megaticks ========

  Num          Tree        Func        Per
  Calls        Time        Time        Call

29879388 44037 11267 0 void dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array) 126827950 4306 3599 0 inout pure nothrow ref @property @nogc return @safe inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload() 29879268 7293 3100 0 pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[]) 33534720 4896 2780 0 const pure nothrow @property @nogc @safe bool std.container.array.Array!(immutable(char)[]).Array.empty() 29879268 11834 2479 0 pure nothrow @nogc ulong std.container.array.Array!(immutable(char)[]).Array.insertBack!(immutable(char)[]).insertBack(immutable(char)[]) 29879268 8702 2279 0 pure nothrow @nogc @safe void std.container.array.Array!(immutable(char)[]).Array.removeBack() 282919518 2045 2045 0 const pure nothrow @property @nogc @safe bool std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized() 29879268 2534 1859 0 pure nothrow @nogc @safe void core.internal.lifetime.emplaceRef!(immutable(char)[], immutable(char)[], immutable(char)[]).emplaceRef(ref immutable(char)[], ref immutable(char)[]) 44511077 3264 1505 0 pure nothrow @nogc void std.container.array.Array!(immutable(char)[]).Array.__fieldDtor() 44511076 2588 1254 0 pure nothrow @nogc scope void std.container.array.Array!(immutable(char)[]).Array.__fieldPostblit() 49758574 1684 1243 0 const pure nothrow @nogc @safe std.int128.Int128 std.int128.Int128.opBinary!("+", char).opBinary(const(char)) 49758574 2907 1223 0 pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+", immutable(char)).opOpAssign(ref immutable(char)) 49975294 1796 1208 0 pure nothrow ref @nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("<<", int).opOpAssign(int)
```

As far as I can tell, the only two bottlenecks now bacame:

* `std.typecons.RefCounted!(Array.Payload, 0).RefCounted.refCountedPayload()`
* the `Int128` implementation!

To fix the former, I would need to implement my own `Array`, I suppose... using ref-counting here seems unnecessary??

But that is a bit beyond what I am willing to do!

The latter could probably be fixed by not using a numeric key at all, just bytes - but my previous attempt at doing that didn't really give better results. Or perhaps by overriding the hashing for Int128 (I didn't try that because I assume the default should be pretty optimal already)??

So, I ran out of options and am going to call it done.

[Here's my final D solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d).

The code is still readable (I think everyone can agree D is one of the most readable languages of all)!

You can see how it performs in the last comparison run I did ([all data, including a nice chart, in this gist](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334)):

```
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,3190730752,441
java-Main,3194789888,1440
java-Main,3193692160,2316
java-Main,3194720256,3604
java-Main,3192963072,12861
java-Main,3261128704,31282
===> sbcl --script src/lisp/main.fasl
sbcl,1275994112,83
sbcl,1275998208,1396
sbcl,1275998208,5925
sbcl,1275998208,9752
sbcl,1275998208,64811
sbcl,1276006400,153799
===> ./rust
./rust,23138304,56
./rust,23138304,234
./rust,23138304,1288
./rust,23138304,2444
./rust,9027584,15867
./rust,9027584,36985
===> src/d/dencoder
src/d/dencoder,219041792,67
src/d/dencoder,229904384,1114
src/d/dencoder,229904384,4421
src/d/dencoder,229904384,9087
src/d/dencoder,219041792,51315
src/d/dencoder,219041792,120818
```

## Conclusion

Using `Int128` as a key instead of `BigInt` made the code much faster (around 4x faster).

Using `Array` to avoid too many reallocations with primitive dynamic arrays made the code run much faster for very large numbers of calls (very little difference when running for less than 10s, but at least 2x faster at round 1min runs where the tiny overhead adds up).

As you can see [in the chart I posted](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334), D's speed performance is close to that of Common Lisp for all runs, though between 10 and 20% faster. It's nowhere near Rust, unfortunately, with Rust being almost 4x faster (pls ignore the Java solution as that's using a Trie instead of the "numeric" solution - so it's not a fair comparison). I had expected to get very close to Rust, but that didn't happen... I just can't see in the profiling data what's causing D to fall so far behind!

On the memory data: D uses much less memory than Java and Common Lisp, but still a lot higher than Rust.

If anyone can find any flaw in my methodology or optmise my code so that it can still get a couple of times faster, approaching Rust's performance, I would greatly appreciate that! But for now, my understanding is that the most promising way to get there would be to write D in `betterC` style?!

Reply via email to