Re: Of possible interest: fast UTF8 validation

Patrick Schluter via Digitalmars-d Thu, 17 May 2018 06:16:20 -0700

On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:

On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescuwrote:
On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshanskywrote:
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, AndreiAlexandrescu wrote:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about peoplespending a bunch of time making more efficient whatshouldn't be done at all.
Validating UTF-8 is super common, most text protocols andfiles these days would use it, other would have an option todo so.
I’d like our validateUtf to be fast, since right now we dovalidation every time we decode string. And THAT is slow.Trying to not validate on decode means most things should bevalidated on input...
I think you know what I'm referring to, which is that UTF-8is a badly designed format, not that input validationshouldn't be done.
I find this an interesting minority opinion, at least from theperspective of the circles I frequent, where UTF8 isunanimously heralded as a great design. Only a couple of weeksago I saw Dylan Beattie give a very entertaining talk onexactly this topic:https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/
Thanks for the link, skipped to the part about text encodings,should be fun to read the rest later.
If you could share some details on why you think UTF8 is badlydesigned and how you believe it could be/have been better, I'dbe in your debt!
Unicode was a standardization of all the existing code pagesand then added these new transfer formats, but I have longthought that they'd have been better off going with aheader-based format that kept most languages in a single-bytescheme,

This is not practical, sorry. What happens when your messageloses the header? Exactly, the rest of the message is garbled.That's exactly what happened with code page based texts when youdon't know in which code page it is encoded. It has thesupplemental inconvenience that mixing languages becomesimpossible or at least very cumbersome.UTF-8 has several properties that are difficult to have withother schemes.- It is state-less, means any byte in a stream always means thesame thing. Its meaning does not depend on external or aprevious byte.- It can mix any language in the same stream without acrobaticsand if one thinks that mixing languages doesn't happen oftenshould get his head extracted from his rear, because it is verycommon (check wikipedia's front page for example).- The multi byte nature of other alphabets is not as bad aspeople think because texts in computer do not live on their own,meaning that they are generally embedded inside file formats,which more often than not are extremely bloated (xml, html,xliff, akoma ntoso, rtf etc.). The few bytes more in the text donot weigh that much.

I'm in charge at the European Commission of the biggesttranslation memory in the world. It handles currently 30languages and without UTF-8 and UTF-16 it would be unmanageable.I still remember when I started there in 2002 when we handledonly 11 languages of which only 1 was of another alphabet(Greek). Everything was based on RTF with codepages and it was abraindead mess. My first job in 2003 was to extend the system tohandle the 8 newcomer languages and with ASCII based encodings itwas completely unmanageable because every document processedmixes languages and alphabets freely (addresses and names areoften written in their original form for instance).2 years ago we implemented also support for Chinese. The nicething was that we didn't have to change much to do that thanks toUnicode. The second surprise was with the file sizes, Chinesedocuments were generally smaller than their Europeancounterparts. Yes CJK requires 3 bytes for each ideogram, butgenerally 1 ideogram replaces many letters. The ideogram 亿replaces "One hundred million" for example, which of them takemore bytes? So if CJK indeed requires more bytes to encode, it isfirstly because they NEED many more bits in the first place(there are around 30000 CJK codepoints in the BMP alone, add toit the 60000 that are in the SIP and we have a need of 17 bitsonly to encode them.

as they mostly were except for obviously the Asian CJKlanguages. That way, you optimize for the common string, ie onethat contains a single language or at least no CJK, rather thanpessimizing every non-ASCII language by doubling its characterwidth, as UTF-8 does. This UTF-8 issue is one of the firsttopics I raised in this forum, but as you noted at the timenobody agreed and I don't want to dredge that all up again.
I have been researching this a bit since then, and the statedgoals for UTF-8 at inception were that it _could not overlapwith ASCII anywhere for other languages_, to avoid issues withlegacy software wrongly processing other languages as ASCII,and to allow seeking from an arbitrary location within a bytestream:
https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
I have no dispute with these priorities at the time, as theywere optimizing for the institutional and tech realities of1992 as Dylan also notes, and UTF-8 is actually a nice hackgiven those constraints. What I question is that thosepriorities are at all relevant today, when billions ofsmartphone users are regularly not using ASCII, and these techcompanies are the largest private organizations on the planet,ie they have the resources to design a new transfer format. Isee basically no relevance for the streaming requirement today,as I noted in this forum years ago, but I can see why it mighthave been considered important in the early '90s, beforepacket-based networking protocols had won.
I think a header-based scheme would be _much_ better today andthe reason I know Dmitry knows that is that I have discussedprivately with him over email that I plan to prototype a formatlike that in D. Even if UTF-8 is already fairly widespread,something like that could be useful as a better intermediateformat for string processing, and maybe someday could replaceUTF-8 too.

Re: Of possible interest: fast UTF8 validation

Reply via email to