> Should the translation be "accurate" or should it be "useful"?

I think it should be accurate for file systems. Such a "useful"  
translation is a good concept for displaying output (maybe even that of  
the DIR command) but not for actually working with the file system.  
Keyboard input can't map one key to several characters at once (unless you  
randomly (-; decide which one to use) so input handling should use  
one-to-one translation too.

> From a technical perspective, you will also at a minimum need to concern
> yourself with translating strings vs. translating single characters  
> (UniCode
> strings can/should include an Endian-defining character at the  
> beginning, as
> well as needing to define how the length of the string is determined),  
> UTF-8
> vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian.  None of this is  
> trivial,
> and I think this is WAY too complicated to be in the kernel -- it should  
> be
> a separate program/driver.

UTF-8 is independent of byte-order. The exact encoding (and byte-order)  
should always either be implicit (in the interface's or format's  
definition) or be marked in some way. The definition of a string's length  
(possibly number of bytes/words/dwords, number of code-points, number of  
"characters") need not be addressed by such an interface. If there is a  
need for a buffer or string length (see below) a new interface should just  
define that all "length" fields/parameters give the length in bytes.

If there was a DOS (kernel) interface, it should probably accept a single  
character (usually one byte, two byte for DBCS) encoded in the currently  
selected code page and return a Unicode code-point. All code-points fit  
into a 24-bit (= 3-byte) number; though such an interface can be limited  
to Unicode's BMP (16-bit numbers (= words)) like the DOSLFN/VC tables. Of  
course there should be an "accurate" reverse interface which accepts a  
24-bit (or 16-bit) number and returns a one- or two-byte character in the  
current code page if one exists for that Unicode code-point.

Notably, some code pages might contain characters that should map to  
several code-points and some code-points might require more than two bytes  
when represented in the current code page's encoding. A string translation  
interface might therefore be more appropriate. (As an aside, this would  
solve the need for a DBCS kludge because multi-byte mappings could be  
supported intrinsically.) In this case, the interface should exactly  
define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) -  
applications have to figure out on their own what encoding their data uses.

Regards,
Christian

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Freedos-devel mailing list
Freedos-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-devel

Reply via email to