Re: [Pharo-project] Fastest utf-8 encoder contest

Henrik Sperre Johansen Wed, 13 Jun 2012 02:40:39 -0700

On 13.06.2012 10:31, Philippe Marschall wrote:

On 06/13/2012 04:44 AM, Igor Stasenko wrote:

Hi, hardcore hackers.
please take a look at the code and tell if it can be improved.


The AsmJit snippet below transforms an unicode integer value
to 1..4-byte sequence of utf-8

then the outer piece of code (which is not yet written) will
accumulate the results of this snippet
to do a memory-aligned (4byte) writes..
like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
(which mostly the case for latin-1 char range), then there will be
4 memory reads (to read four 32-bit unicode values) but only single
memory write (to write four 8-bit utf-8 encoded values).

The idea is to make utf-8 encoding speed close to memory copying speed :)


In Seaside 3.1 we go one step further. Imagine you have a long
ByteString and only few non-ASCII characters. We do not want to have to
copy the whole string just to utf-8 encode a few characters, so we
combine the above approach with #next:putAll:startingAt: so that we only
have to encode and copy the non-ASCII characters, everything else is not
copied.


Cheers
Philippe

Both Pharo and Squeak default TextConverters have done something similarfor the last 1 1/2 years, see (in Pharo) nextPutByteString:toSteam:What Igor describes seems aimed at encoding WideString -> utf8 though,which is still slow with the default converters.

As to the assembly, is leadingChar gone entirely? Otherwise thebranching may fail miserably.



Cheers,
Henry

Re: [Pharo-project] Fastest utf-8 encoder contest

Reply via email to