On 29-12-2021 16:30, Martin Frb via lazarus wrote:



Could you post full source if you haven't already? For a bit of benchmarking. I just wrote it from the top of my head, and I assumed 5 instructions for 16-byte would win any time, but haven't verified anything yet.
I had it attached on my last mail. Attached it again here. (3rd procedure / "Utf8LengthAdd")

It is only 64bit for now. (And not cleaned up in any way).

Also changing "bc >> 7" and "bc and 127"
to "moddiv(bc, 255, full, remain)" might save a few more ms. But probably needs larger data to benchmark.

If you do work on this, feel free to integrate my code as the baseline for cpu without SSE.
Otherwise, it might be a bit until I get to it.

First results: (on an ageing i7-3770, trunk FPC -O4 -Cpcoreavx)

fst 781
fst 781
fst 797
fst 766
pop 656
pop 641
pop 640
pop 641
add 562
add 578
add 563
add 594
asm 297
asm 296
asm 297
asm 297

Asm is nearly fully functional and working, more importantly the remaining issues are constant time and single instruction work, shouldn't influence benchmarking for anything than the shortest sequences.

I'll finish up and post the whole shebang, since more eyes could help, I'm an asm amateur in some regards.
--
_______________________________________________
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Reply via email to