Hi Christian, >> Should the translation be "accurate" or should it be "useful"?
That depends a lot on which languages we are talking about. For the DISPLAYING of already existing strings such as file names on some USB stick made by somebody using Linux, MacOS or Windows, if your language is "something latin", you can get reasonable results with a simplified display which just drops accents from characters if your current codepage does not have the needed accented char but has a similar char. If you try the same with Russian, you will at least have to switch to a Cyrillic codepage or maybe have both active at the same time (VGA supports dual codepages: 512 chars). But if our imaginary USB stick contains the Anime collection of your Japanese friend, any attempt to display the file names in any western or Cyrillic codepage will look really bad. In the other direction, you may want to GENERATE strings in Unicode. Of course KEYB, MKEYB and similar support switched and local codepages. I assume that DOSLFN, KEYB and DISPLAY can signal each other to let you use a suitable layout and codepage to give your files Cyrillic names, display them in the right way and read/write file names as UTF8 on your USB stick... Somebody should check the documentation for more details ;-). Yet again, try the same with ASIAN languages: You would need an Input Method driver which lets you type complex key sequences or combinations to type in a language which has more than the usual few dozen chars of alphabet. For CJK languages, you typically also need a wide font, the usual 8 or 9 pixels of width will not usually be enough. So you probably end up using a graphics mode CON driver or any similar system, probably with a relatively big font with at least 100s of different character shapes in RAM, maybe XMS. > UTF-8 is independent of byte-order. The exact encoding (and byte-order) > should always either be implicit (in the interface's or format's > definition) or be marked in some way. The definition of a string's length > (possibly number of bytes/words/dwords, number of code-points, number of > "characters") need not be addressed by such an interface. If there is a > need for a buffer or string length (see below) a new interface should just > define that all "length" fields/parameters give the length in bytes. I would also vote for UTF8: It keeps ASCII strings unchanged and strings with only a few non-ASCII chars will only get a few bytes longer, e.g. strings with accented chars in them. In addition, you get a sort of graceful degradation: Tools which are not Unicode-aware would treat the strings as if they use some unknown codepage. So such tools would think that AndrXX where XX is an encoding for an accented e has 6 characters but at least you can still see the "Andr" in it. In the other direction, if you accidentally put in a text with Latin1 or codepage 858 / 850 encoding, you get AndrY where Y is the codepage style encoding of the accented "e" and the Y and possibly one char after it would be shown in a broken way by a CON driver which expects UTF8 instead. As you already say, for BETTER compatibility, you always have to be aware whether or not your string uses UTF8 or codepage encoding. In theory you could also support DBCS or UTF16-LE or similar, but I would vote against those. This awareness will mean that you know how to RENDER the string (e.g. switch fonts or mode of CON driver or use a built-in rendering as in Blocek) and how many CHARACTERS and BYTES the string is long and what is ONE CHARACTER, for example for sorting or when you replace/edit a char. As said, UTF8 has relatively graceful degradation, but you still want explicit support for more heavy uses like text editors, playlists, file managers and similar :-) I do not understand the "codepoints are 24 bit numbers" issue. Unicode chars with numbers above 65535 are very exotic in everyday languages so I would not even start to support them in DOS. If you mean UTF8, then what you get is 2 bytes for characters from U+0080 to U+07ff and 3 bytes for characters from U+0800 to U+ffff - so only for chars with numbers above 65535 you would need 4 or even more bytes to UTF8 encode one character :-) > define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) Luckily UTF8 is quite common and compact and byte order independent. I think Mac / Office sometimes might use one of the UTF16 encodings but otherwise they are not so widespread. The UTF32 encodings are even VERY rare. > apps have to figure out on their own what encoding their data uses. That hopefully only affects text editors ;-) Regards, Eric ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel