Re: [Freedos-devel] ASCII to unicode table

Christian Masloch Wed, 01 Dec 2010 08:33:05 -0800

> You would need an Input Method driver which lets you type
> complex key sequences or combinations to type in a language
> which has more than the usual few dozen chars of alphabet.


Yes. The (keyboard) input and (screen) output appears to be the most  
complicated exercise here. DBCS or UTF-8 support inside other programs  
would appear less complicated - as far as I know, DOSLFN properly supports  
DBCS. (UTF-8 appears to be easier than DBCS, but I didn't look into the  
details of the latter.)

> In addition, you get a sort of graceful degradation: Tools
> which are not Unicode-aware would treat the strings as if
> they use some unknown codepage. So such tools would think
> that AndrXX where XX is an encoding for an accented e has 6
> characters but at least you can still see the "Andr" in it.
>
> In the other direction, if you accidentally put in a text
> with Latin1 or codepage 858 / 850 encoding, you get AndrY
> where Y is the codepage style encoding of the accented "e"
> and the Y and possibly one char after it would be shown in
> a broken way by a CON driver which expects UTF8 instead.

Arguably, the UTF-8 "compatibility" is worse here: with the actual  
encoding in any code page (not DBCS or UTF-8), displaying the string in  
another code page will replace each non-ASCII character by one random  
character of the active code page. With UTF-8, non-ASCII character are  
encoded as multi-byte sequences - resulting in several random characters  
of the active code page, where actually only one code-point is encoded.

> I do not understand the "codepoints are 24 bit numbers"
> issue. Unicode chars with numbers above 65535 are very
> exotic in everyday languages

That is why I said it's not that important.

> If you mean UTF8,

No. That would not make sense. A code-point is usually written like  
"U+0038", with 4 to 6 hexadecimal digits that give you the numeric value  
of that code-point. The "character set", Unicode, defines code-points. The  
encoding, UTF-8, defines how (almost) arbitrary numeric values are to be  
encoded into a stream of bytes. UTF-8 support easily scales to support all  
currently reserved code-points which do not fit into a 16-bit number, if  
the underlying interface supports them. (A 21-bit number is large enough  
for all code-points.)

> I think Mac / Office sometimes might use
> one of the UTF16 encodings but otherwise they are not
> so widespread.

Don't forget FAT's long file names ;-)

Regards,
Christian

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Re: [Freedos-devel] ASCII to unicode table

Reply via email to