On 13.06.2012 10:31, Philippe Marschall wrote:
On 06/13/2012 04:44 AM, Igor Stasenko wrote:
Hi, hardcore hackers.
please take a look at the code and tell if it can be improved.

The AsmJit snippet below transforms an unicode integer value
to 1..4-byte sequence of utf-8

then the outer piece of code (which is not yet written) will
accumulate the results of this snippet
to do a memory-aligned (4byte) writes..
like that, if 4 unicode characters can be encoded into 4 utf-8 bytes
(which mostly the case for latin-1 char range), then there will be
4 memory reads (to read four 32-bit unicode values) but only single
memory write (to write four 8-bit utf-8 encoded values).

The idea is to make utf-8 encoding speed close to memory copying speed :)

In Seaside 3.1 we go one step further. Imagine you have a long
ByteString and only few non-ASCII characters. We do not want to have to
copy the whole string just to utf-8 encode a few characters, so we
combine the above approach with #next:putAll:startingAt: so that we only
have to encode and copy the non-ASCII characters, everything else is not
copied.


Cheers
Philippe


Both Pharo and Squeak default TextConverters have done something similar for the last 1 1/2 years, see (in Pharo) nextPutByteString:toSteam: What Igor describes seems aimed at encoding WideString -> utf8 though, which is still slow with the default converters.

As to the assembly, is leadingChar gone entirely? Otherwise the branching may fail miserably.


Cheers,
Henry

Reply via email to