"H. S. Teoh" <hst...@quickfur.ath.cx> wrote in message news:mailman.2179.1335486409.4860.digitalmar...@puremagic.com... > > Have you seen U+9598? It's an insanely convoluted glyph composed of > *three copies* of an already extremely complex glyph. > > http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png > > (And yes, that huge thing is supposed to fit inside a SINGLE > character... what *were* those ancient Chinese scribes thinking?!) >
Yikes! > >> For example, I have my font size in Windows Notepad set to a >> comfortable value. But when I want to use hiragana or katakana, I have >> to go into the settings and increase the font size so I can actually >> read it (Well, to what *little* extent I can even read it in the first >> place ;) ). And those kana's tend to be among the simplest CJK >> characters. >> >> (Don't worry - I only use Notepad as a quick-n-dirty scrap space, >> never for real coding/writing). > > LOL... love the fact that you felt obligated to justify your use of > notepad. :-P > Heh, any usage of Notepad *needs* to be justified. For example, it has an undo buffer of exactly ONE change. And the stupid thing doesn't even handle Unix-style newlines. *Everything* handes Unix-style newlines these days, even on Windows. Windows *BATCH* files even accept Unix-style newlines, for goddsakes! But not Notepad. It is nice in it's leanness and no-nonsence-ness. But it desperately needs some updates. At least it actually supports Unicode though. (Which actually I find somewhat surprising.) 'Course, this is all XP. For all I know maybe they have finally updated it in MS OSX, erm, I mean Vista and Win7... > >> > So we really need all four lengths. Ain't unicode fun?! :-) >> > >> >> No kidding. The *one* thing I really, really hate about Unicode is the >> fact that most (if not all) of its complexity actually *is* necessary. > > We're lucky the more imaginative scribes of the world have either been > dead for centuries or have restricted themselves to writing fictional > languages. :-) The inventions of the dead ones have been codified and > simplified by the unfortunate people who inherited their overly complex > systems (*cough*CJK glyphs*cough), and the inventions of the living ones > are largely ignored by the world due to the fact that, well, their > scripts are only useful for writing fictional languages. :-) > > So despite the fact that there are still some crazy convoluted stuff out > there, such as Arabic or Indic scripts with pair-wise substitution rules > in Unicode, overall things are relatively tame. At least the > subcomponents of CJK glyphs are no longer productive (actively being > used to compose new characters by script users) -- can you imagine the > insanity if Unicode had to support composition by those radicals and > subparts? Or if Unicode had to support a script like this one: > > http://www.arthaey.com/conlang/ashaille/writing/sarapin.html > > whose components are graphically composed in, shall we say, entirely > non-trivial ways (see the composed samples at the bottom of the page)? > That's insane! And yet, very very interesting... >> >> While I find that very intersting...I'm afraid I don't actually >> understand your suggestion :/ (I do understand FSM's and how they >> work, though) Could you give a little example of what you mean? > [...] > > Currently, std.uni code (argh the pun!!) Hah! :) > is hand-written with tables of > which character belongs to which class, etc.. These hand-coded tables > are error-prone and unnecessary. For example, think of computing the > layout width of a UTF-8 stream. Why waste time decoding into dchar, and > then doing all sorts of table lookups to compute the width? Instead, > treat the stream as a byte stream, with certain sequences of bytes > evaluating to length 2, others to length 1, and yet others to length 0. > > A lexer engine is perfectly suited for recognizing these kinds of > sequences with optimal speed. The only difference from a real lexer is > that instead of spitting out tokens, it keeps a running total (layout) > length, which is output at the end. > > So what we should do is to write a tool that processes Unicode.txt (the > official table of character properties from the Unicode standard) and > generates lexer engines that compute various Unicode properties > (grapheme count, layout length, etc.) for each of the UTF encodings. > > This way, we get optimal speed for these algorithms, plus we don't need > to manually maintain tables and stuff, we just run the tool on > Unicode.txt each time there's a new Unicode release, and the correct > code will be generated automatically. > I see. I think that's a very good observation, and a great suggestion. In fact, it'd imagine it'd be considerably simpler than a typial lexer generator. Much less of the fancy regexy-ness would be needed. Maybe put together a pull request if you get the time...?