On Sun, May 21, 2017 at 3:46 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
> I guess instead of looking at the relative slowness and pondering
> acceleration tables, I should measure how much Chinese or Japanese
> text a Raspberry Pi 3 (the underpowered ARM device I have access to
> and that has predictable-enough scheduling to be benchmarkable in a
> usefully repeatable way unlike Android devices) can legacy-encode in a
> tenth of a second or 1/24th of a second without an acceleration table.
> (I posit that with the network roundtrip happening afterwards, no one
> is going to care if the form encode step in the legacy case takes up
> to one movie frame duration. Possibly, the "don't care" allowance is
> much larger.)

Here are numbers from ARMv7 code running on RPi3:

UTF-16 to Shift_JIS: 626000 characters per second or the
human-readable non-markup text of a Wikipedia article in 1/60th of a
second.

UTF-16 to GB18030 (same as GBK for the dominant parts): 206000
characters per second or the human-readable non-markup text of a
Wikipedia article in 1/15th of a second

UTF-16 to Big5: 258000 characters per second or the human-readable
non-markup text of a Wikipedia article in 1/20th of a second

Considering that usually a user submits considerably less than a
Wikipedia article's worth of text in a form at a time, I think we can
conclude that as far as user perception of form submission goes, it's
OK to ship Japanese and Chinese legacy encoders that do linear search
over decode-optimized data (no encode-specific data structures at all)
and are extremely slow *relative* (by a factor of over 200!) to UTF-16
to UTF-8 encode.

The test data I used was:
https://github.com/hsivonen/encoding_bench/blob/master/src/wikipedia/zh_tw.txt
https://github.com/hsivonen/encoding_bench/blob/master/src/wikipedia/zh_cn.txt
https://github.com/hsivonen/encoding_bench/blob/master/src/wikipedia/ja.txt

So it's human-authored text, but my understanding is that the
Simplified Chinese version has been machine-mapped from the
Traditional Chinese version, so it's possible that some slowness of
the Simplified Chinese case is attributable to the conversion from
Traditional Chinese exercising less common characters than if it had
been human-authored directly as Simplified Chinese.

Japanese is not fully ideographic and the kana mapping is a matter of
a range check plus offset, which is why the Shift_JIS case is so much
faster.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to