Re: [Freedos-devel] ASCII to unicode table
Hi Alex, > * in some cases best readability is the target [best user experience] > * in some cases exact string representation is the target [copy+paste, debug] > * in some other cases you simply want to be fast [viewing text/binary files] ... > for exact representations any sort of escaped character sequences might be > used. > readability instead requires different substitution rules. ... > for the fast case you dont care about accuracy but only > send e.g. a dot to the console for anything not simply > convertible - thats how hex editors do it for ages. That is interesting, but I also wonder: Which size of font do people want and where do they want to process Unicode? In file contents, file names, URLs? Only in a special app (e.g. Blocek Unicode text editor for DOS) or everywhere? Do they also want to type Unicode? Or maybe use some sort of popup char table to enter Unicode? Or just not type it? Eric -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
My personal summary on the practical aspect is: * in some cases best readability is the target * in some cases exact string representation is the target * in some other cases you simply want to be fast in the first case you are probably targetting for best user expirience. in the second case you are targetting for e.g. debugging or copy&past on the shell. in the third case you are probably listing a text or binary file on the console. for exact representations any sort of escaped character sequences might be used. readability instead requires different substitution rules. for this its possible but not in all cases equally desirable to change character sets and fonts of the displaying console. and probably the most determining factor: the reversal of the substitution rules will have much of ambiguities in most cases - only a few cases (e.g. the german char set with only some 7 extra characters) have no ambiguities. so you should save you a brainer never doing backwards translations. for the fast case you dont care about accuracy but only send e.g. a dot to the console for anything not simply convertible - thats how hex editors do it for ages. regards, Alex. -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
is DPMI out of the question? From: Eric Auer To: freedos-devel@lists.sourceforge.net Sent: Wed, December 1, 2010 2:59:56 PM Subject: Re: [Freedos-devel] ASCII to unicode table Hi Christian, Using UTF-8 CON with a codepage based app or vice versa is worse for block graphics than just using the wrong codepage, as not only the shape but also the number of displayed characters will change: Everything outside of basic ASCII takes 2 or more bytes in UTF-8, so display on codepage-CON will show block graphics "too wide". In the other direction, sequences can contain invalid groups or invalid start bytes, so trying to show block graphics from codepage-apps on UTF8-CON will typically show as many or fewer "bad char" chars than the number of block graphics chars that the app wanted to display. >> A possible workaround would be dosver-style, to make >> a per-app decision who uses Unicode. Because DOS is not multitasking, you do not have to put status flags in the PSP... You just switch to codepage (or whatever default you want) mode when anything exits and switch to UTF8 mode (...) when either an app starts which you know to be UTF8 tolerant or when a modern app explicitly switches to UTF8 mode. You are right that a TSR pop-up would not fit in that scheme BUT as far as I remember, pop-ups always write to the VGA directly, so they cannot use the UTF8 CON. If the UTF8 CON uses the graphics mode to render text (because otherwise you can only keep a small 512 char font in hardware) it is very possible that you will not see your TSR pop-up at all. > I'd propose to use a new interface instead - this new interface > then always uses UTF-8, the normal one will use code pages (or > reject CP-dependent characters). (Of course using only ASCII it > doesn't matter which interface you use.) What that could mean is having a UNICODE$ char device, similar to the existing MORE$ device which you already know: It forwards text to CON but waits for a keypress after every 25 line breaks, so MORE$ (moresys) shows text immediately where MORE (the app) has to wait for all text to arrive first before starting to show any of it (because DOS does not have real | pipelines...). Well... Coming back from this blatant ad ;-) A driver which provides this UNICODE$ device could either do a "best effort" translation of incoming text to whatever the current codepage is at that moment or it could do a graphical rendering of the text. In the latter case, you can only show text while the VGA is in graphics mode which is acceptable for classic CON (just slow) but which, as said, will break your TSR pop-up text. > The DOS LFN API works with code page encoded strings. Wow. Well at least the DOS LFN directory item data is based on Unicode already. So it could have been worse. Eric -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
Hi Christian, Using UTF-8 CON with a codepage based app or vice versa is worse for block graphics than just using the wrong codepage, as not only the shape but also the number of displayed characters will change: Everything outside of basic ASCII takes 2 or more bytes in UTF-8, so display on codepage-CON will show block graphics "too wide". In the other direction, sequences can contain invalid groups or invalid start bytes, so trying to show block graphics from codepage-apps on UTF8-CON will typically show as many or fewer "bad char" chars than the number of block graphics chars that the app wanted to display. >> A possible workaround would be dosver-style, to make >> a per-app decision who uses Unicode. Because DOS is not multitasking, you do not have to put status flags in the PSP... You just switch to codepage (or whatever default you want) mode when anything exits and switch to UTF8 mode (...) when either an app starts which you know to be UTF8 tolerant or when a modern app explicitly switches to UTF8 mode. You are right that a TSR pop-up would not fit in that scheme BUT as far as I remember, pop-ups always write to the VGA directly, so they cannot use the UTF8 CON. If the UTF8 CON uses the graphics mode to render text (because otherwise you can only keep a small 512 char font in hardware) it is very possible that you will not see your TSR pop-up at all. > I'd propose to use a new interface instead - this new interface > then always uses UTF-8, the normal one will use code pages (or > reject CP-dependent characters). (Of course using only ASCII it > doesn't matter which interface you use.) What that could mean is having a UNICODE$ char device, similar to the existing MORE$ device which you already know: It forwards text to CON but waits for a keypress after every 25 line breaks, so MORE$ (moresys) shows text immediately where MORE (the app) has to wait for all text to arrive first before starting to show any of it (because DOS does not have real | pipelines...). Well... Coming back from this blatant ad ;-) A driver which provides this UNICODE$ device could either do a "best effort" translation of incoming text to whatever the current codepage is at that moment or it could do a graphical rendering of the text. In the latter case, you can only show text while the VGA is in graphics mode which is acceptable for classic CON (just slow) but which, as said, will break your TSR pop-up text. > The DOS LFN API works with code page encoded strings. Wow. Well at least the DOS LFN directory item data is based on Unicode already. So it could have been worse. Eric -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> You would need an Input Method driver which lets you type > complex key sequences or combinations to type in a language > which has more than the usual few dozen chars of alphabet. Yes. The (keyboard) input and (screen) output appears to be the most complicated exercise here. DBCS or UTF-8 support inside other programs would appear less complicated - as far as I know, DOSLFN properly supports DBCS. (UTF-8 appears to be easier than DBCS, but I didn't look into the details of the latter.) > In addition, you get a sort of graceful degradation: Tools > which are not Unicode-aware would treat the strings as if > they use some unknown codepage. So such tools would think > that AndrXX where XX is an encoding for an accented e has 6 > characters but at least you can still see the "Andr" in it. > > In the other direction, if you accidentally put in a text > with Latin1 or codepage 858 / 850 encoding, you get AndrY > where Y is the codepage style encoding of the accented "e" > and the Y and possibly one char after it would be shown in > a broken way by a CON driver which expects UTF8 instead. Arguably, the UTF-8 "compatibility" is worse here: with the actual encoding in any code page (not DBCS or UTF-8), displaying the string in another code page will replace each non-ASCII character by one random character of the active code page. With UTF-8, non-ASCII character are encoded as multi-byte sequences - resulting in several random characters of the active code page, where actually only one code-point is encoded. > I do not understand the "codepoints are 24 bit numbers" > issue. Unicode chars with numbers above 65535 are very > exotic in everyday languages That is why I said it's not that important. > If you mean UTF8, No. That would not make sense. A code-point is usually written like "U+0038", with 4 to 6 hexadecimal digits that give you the numeric value of that code-point. The "character set", Unicode, defines code-points. The encoding, UTF-8, defines how (almost) arbitrary numeric values are to be encoded into a stream of bytes. UTF-8 support easily scales to support all currently reserved code-points which do not fit into a 16-bit number, if the underlying interface supports them. (A 21-bit number is large enough for all code-points.) > I think Mac / Office sometimes might use > one of the UTF16 encodings but otherwise they are not > so widespread. Don't forget FAT's long file names ;-) Regards, Christian -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> Combined with, for example, a UTF-8 enabled Super-NANSI to > make the step from strings to their display, of course. The > problem would be loss of "ASCII" art block graphics in apps > which are not using Unicode. But that happens for some code pages anyway. (For example, CPs 858 and 850 drop some of the CP 437 block graphics. CPs that need more characters probably drop all of them.) > A possible workaround would be > dosver-style, to make a per-app decision who uses Unicode. > > [...] > > Some old apps will only use ASCII anyway which is the same > for real ASCII and for UTF8 but some others will assume a > codepage (often 437) to be active. The block graphics and > other chars from the non-ASCII half of any codepage differ > in encoding from UTF8 so, as said, any display or similar > driver would need some way to switch between "classic code > page mode" and "UTF8 rendering mode". It could switch on > UTF8 based on explicit request from a modern app or based > on app name for old but known compatible apps... It would > switch off UTF8 when any app exits (int 21.4c / 21.31...). I don't like such an approach. You would have to keep the current status in a PSP field. And even then, pop-up TSRs might *interrupt* the currently running process (without switching the PSP or saving/restoring other fields). One of the TSRs I'm regularly using displays its pop-up using block graphics. I'd propose to use a new interface instead - this new interface then always uses UTF-8, the normal one will use code pages (or reject CP-dependent characters). (Of course using only ASCII it doesn't matter which interface you use.) > If yes, I do > assume that the LFN API already is explicit about whether > UTF8 or rather codepage style encoding should be used? The DOS LFN API works with code page encoded strings. Regards, Christian -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
On Wed, 1 Dec 2010, Eric Auer wrote: > Compatible apps would be apps which only display ASCII out > of themselves and which make no serious assumptions about > one byte being equal to one character. A good example are > MORE and TYPE: If you TYPE an UTF8 text with a special CON > driver which expects and renders UTF8, it will simply work > because TYPE passes the text file 1:1 and only uses plain > ASCII for built-in messages, if any. A good counter example > are PG and EDIT: They make the byte-is-character assumption > for scrolling (in particular horizontal scrolling) and EDIT > uses block graphics chars of codepages. So you have to put > your CON driver in NON-Unicode mode while using EDIT or PG. Kind-of like "chev us" or "chev jp" in DOS/V. > Or is the idea to have "Unicode everywhere", even in the > PrintScreen hotkey, TREE, Undelete, the volume label for > SYS / FORMAT / VOL / LABEL, tools like FIND or DEBUG...? Prolly. Thought for PrtSc, isn't that what GRAPHICS is for? -uso. -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
Hi Christian, >> Should the translation be "accurate" or should it be "useful"? That depends a lot on which languages we are talking about. For the DISPLAYING of already existing strings such as file names on some USB stick made by somebody using Linux, MacOS or Windows, if your language is "something latin", you can get reasonable results with a simplified display which just drops accents from characters if your current codepage does not have the needed accented char but has a similar char. If you try the same with Russian, you will at least have to switch to a Cyrillic codepage or maybe have both active at the same time (VGA supports dual codepages: 512 chars). But if our imaginary USB stick contains the Anime collection of your Japanese friend, any attempt to display the file names in any western or Cyrillic codepage will look really bad. In the other direction, you may want to GENERATE strings in Unicode. Of course KEYB, MKEYB and similar support switched and local codepages. I assume that DOSLFN, KEYB and DISPLAY can signal each other to let you use a suitable layout and codepage to give your files Cyrillic names, display them in the right way and read/write file names as UTF8 on your USB stick... Somebody should check the documentation for more details ;-). Yet again, try the same with ASIAN languages: You would need an Input Method driver which lets you type complex key sequences or combinations to type in a language which has more than the usual few dozen chars of alphabet. For CJK languages, you typically also need a wide font, the usual 8 or 9 pixels of width will not usually be enough. So you probably end up using a graphics mode CON driver or any similar system, probably with a relatively big font with at least 100s of different character shapes in RAM, maybe XMS. > UTF-8 is independent of byte-order. The exact encoding (and byte-order) > should always either be implicit (in the interface's or format's > definition) or be marked in some way. The definition of a string's length > (possibly number of bytes/words/dwords, number of code-points, number of > "characters") need not be addressed by such an interface. If there is a > need for a buffer or string length (see below) a new interface should just > define that all "length" fields/parameters give the length in bytes. I would also vote for UTF8: It keeps ASCII strings unchanged and strings with only a few non-ASCII chars will only get a few bytes longer, e.g. strings with accented chars in them. In addition, you get a sort of graceful degradation: Tools which are not Unicode-aware would treat the strings as if they use some unknown codepage. So such tools would think that AndrXX where XX is an encoding for an accented e has 6 characters but at least you can still see the "Andr" in it. In the other direction, if you accidentally put in a text with Latin1 or codepage 858 / 850 encoding, you get AndrY where Y is the codepage style encoding of the accented "e" and the Y and possibly one char after it would be shown in a broken way by a CON driver which expects UTF8 instead. As you already say, for BETTER compatibility, you always have to be aware whether or not your string uses UTF8 or codepage encoding. In theory you could also support DBCS or UTF16-LE or similar, but I would vote against those. This awareness will mean that you know how to RENDER the string (e.g. switch fonts or mode of CON driver or use a built-in rendering as in Blocek) and how many CHARACTERS and BYTES the string is long and what is ONE CHARACTER, for example for sorting or when you replace/edit a char. As said, UTF8 has relatively graceful degradation, but you still want explicit support for more heavy uses like text editors, playlists, file managers and similar :-) I do not understand the "codepoints are 24 bit numbers" issue. Unicode chars with numbers above 65535 are very exotic in everyday languages so I would not even start to support them in DOS. If you mean UTF8, then what you get is 2 bytes for characters from U+0080 to U+07ff and 3 bytes for characters from U+0800 to U+ - so only for chars with numbers above 65535 you would need 4 or even more bytes to UTF8 encode one character :-) > define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) Luckily UTF8 is quite common and compact and byte order independent. I think Mac / Office sometimes might use one of the UTF16 encodings but otherwise they are not so widespread. The UTF32 encodings are even VERY rare. > apps have to figure out on their own what encoding their data uses. That hopefully only affects text editors ;-) Regards, Eric -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes a
Re: [Freedos-devel] ASCII to unicode table
Hi Christian, > Just noticing that this grows quite large. If someone finds this > unbearable for this list, please speak up to let me know I should cut down > the off-topic stuff on my public mails! No problem :-) I would hope that people talk more about the "big font" approaches - Having either a big Unicode font in XMS or maybe a 512 char double code page in the VGA card... Combined with, for example, a UTF-8 enabled Super-NANSI to make the step from strings to their display, of course. The problem would be loss of "ASCII" art block graphics in apps which are not using Unicode. A possible workaround would be dosver-style, to make a per-app decision who uses Unicode. I do not think that you could trust the data for this. Even on Linux where Unicode is quite common now, usage of BOM is rare. People try to keep their set of apps consistent to use either UTF-8 everywhere or Latin1 everywhere or (preferred) use whichever the LANG etc environment variables select at the moment when the app starts. Given that DOS has many old unmaintained apps, you will have to accept mixing in DOS: Some old apps will only use ASCII anyway which is the same for real ASCII and for UTF8 but some others will assume a codepage (often 437) to be active. The block graphics and other chars from the non-ASCII half of any codepage differ in encoding from UTF8 so, as said, any display or similar driver would need some way to switch between "classic code page mode" and "UTF8 rendering mode". It could switch on UTF8 based on explicit request from a modern app or based on app name for old but known compatible apps... It would switch off UTF8 when any app exits (int 21.4c / 21.31...). Compatible apps would be apps which only display ASCII out of themselves and which make no serious assumptions about one byte being equal to one character. A good example are MORE and TYPE: If you TYPE an UTF8 text with a special CON driver which expects and renders UTF8, it will simply work because TYPE passes the text file 1:1 and only uses plain ASCII for built-in messages, if any. A good counter example are PG and EDIT: They make the byte-is-character assumption for scrolling (in particular horizontal scrolling) and EDIT uses block graphics chars of codepages. So you have to put your CON driver in NON-Unicode mode while using EDIT or PG. As a general question - I would really like to know for WHICH APPS people want to have Unicode support. Is this only about proper display of playlists in MPXPLAY and of CD, USB or local accented filenames in any file manager? Is the issue also in general command.com style activity, probably depending on DOSLFN being present? If yes, I do assume that the LFN API already is explicit about whether UTF8 or rather codepage style encoding should be used? Are text editors also a case which should support Unicode and if yes, why do you not use for example Blocek then? Or is the idea to have "Unicode everywhere", even in the PrintScreen hotkey, TREE, Undelete, the volume label for SYS / FORMAT / VOL / LABEL, tools like FIND or DEBUG...? Eric -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> I think your attitude is not very constructive. We have to keep this > idea as simple as possible or nobody implements it. I think some of that is important, even if you only want to implement a simple translation. Besides, of course it isn't very constructive to *discuss* an idea. Go use DOSLFN's source (free/PD) and implement an interface if you want to be constructive, should be enough pointers here by now. > I think it is not needed to make tables UNICODE to ASCII. > It is sufficient to make ASCII to UNICODE. Please be specific, I think what you are saying is not what you mean. I assume that when you say "ASCII" you mean "current code page" because ASCII to Unicode (and the reverse) translation doesn't require any table at all. Strictly, the ASCII contains a set of 128 codes - these all have the same numeric value as the associated Unicode code-points. You might be proposing that the implementation should be, as Bret put it, "accurate" - ie it should only map exact matches, ignoring "pairs of characters that look similar enough" (Bret's "useful"). The literal sense of your words is that the implementation should be unable (!) to look up what a particular Unicode code-point should be mapped to in the current code page (only accurate matches). This is undesirable as it would unnecessarily hinder many applications. > Simple table - on one side 256 bytes - on second side 256 words. > That is all. You actually need only 128 words for what you have in mind - the lower 128 word table entries can be dropped, because the ASCII characters/bytes always map directly to Unicode code-points. The byte table (containing the associated byte in the current code page) can be dropped entirely, because its contents will just count upward. (That table format matches what DOSLFN uses for simple (256-character) code pages. DBCS mapping needs to be a lot more complicated. Though you might not care, I suggest one consult DOSLFN's source if one is interested in DBCS mapping.) As I mentioned, with a table consisting of (16-bit) words for the Unicode side you cannot map all Unicode code-points. Granted, this is not very important in practice. Regards, Christian -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
I think your attitude is not very constructive. We have to keep this idea as simple as possible or nobody implements it. I think it is not needed to make tables UNICODE to ASCII. It is sufficient to make ASCII to UNICODE. Simple table - on one side 256 bytes - on second side 256 words. That is all. -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> UniCode is not the panacea it's purported to be. No, but you have to give them that it's certainly an improvement. >> UTF-8 is independent of byte-order. The exact encoding (and byte-order) >> should always either be implicit (in the interface's or format's >> definition) or be marked in some way. > > I don't think there is a way to automatically determine the encoding from > the data itself, Yes, you cannot reliably automatically determine encoding. That's why I said you should *know* what data you deal with. (Automatic determination of encoding is a serious problem in dealing with plain text files, but that need not concern a kernel code translation interface such as the one I have in mind.) > and the only way to determine the byte-order (assuming it's > not UTF-8, not a single character, and is unknown from the context) is to > include the special BOM (Byte Order Mark) character as the first > character > of the string. Yes. > In fact, according to the UniCode spec, if the BOM is not > included and the byte-order is not clear from the context, you're > supposed > to assume big-endian. I don't know about that. But I guess that is the case if you say so. > For file system and similar applications, the interface could just always > assume a specific format (probably either UTF-8 or UTF-16LE). Yes. For example, the (in)famous FAT "long" file names are stored in UTF-16LE. Their length is determined by their ASCIZ ("UTF-16LZ") nature ie they are terminated by a 16-bit word of the value zero. If a file system interface (such as Int21/Int21.71) was to be made Unicode-capable I would probably use UTF-8. (Particularly because of the ASCII compatibility, where only characters >= 80h ("codepage-dependent" so to speak) represent code-points >= U+0080.) > For a > general-purpose interface, though, you should be able to handle all > different kinds of possibilities (including things like "UTF-24" and > "UTF-64"). UTF-24 would be pretty funny. (FAT24 is an actual idea I had. Would work well enough.) Even theoretically, UTF-64 doesn't make a lot of sense: a 24-bit (let alone 32-bit) encoding can already represent more values than are currently reserved for all Unicode code-points. Alignment of each single code-point is no particularly good reason to unnecessarily double (you might speak of "bloat" (-; ) the space required to store any given string. 64-bit alignment of the whole string can still be achieved by storing an unused dword behind the actual string if it contains an odd number of dwords; accesses can be aligned by always accessing a whole qword then selecting the appropriate dword and discarding the other. > Also, even though you're dealing with DOS doesn't necessarily > mean everything will be little-endian -- it depends on the source of the > data. Certain hardware interfaces (like SCSI) are inherently big-endian, > and data downloaded from external sources can be almost anything. Yeah. > Another possibility is what my UNI2ASCI program does, which is accept > strings terminated with a specific character (in my case, the UniCode NUL > character, conceptually similar to ASCIIZ). A general-purpose program > should provide more than one way to define a string's length. I guess specifying the length in bytes is good enough. If you want to provide such an interface NUL-terminated (or CP/M-style dollar-terminated (-; ) strings, write a wrapper function which counts the number of non-NUL bytes/words/tri-bytes/dwords/qwords before passing the string to that interface. For non-UTF-8 Unicode encodings, a number of bytes not divisible by the length of the expected units (2, 3, 4, 8) could just cause an error. Generally speaking, error handling is important. Correct UTF-8 validation isn't pretty though. > If you limit > input to only certain encodings or byte-orders or string/character types, > then it ceases to be "general-purpose". Maybe a general-purpose program > is > not what we're really talking about here, but I think one needs to be > developed. Yes, yes. I don't think a general-purpose translation program is what was initially suggested (correct me though). Regards, Christian Just noticing that this grows quite large. If someone finds this unbearable for this list, please speak up to let me know I should cut down the off-topic stuff on my public mails! -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
Christian Masloch wrote: > > I think it should be accurate for file systems. Such a "useful" > translation is a good concept for displaying output (maybe even that of > the DIR command) but not for actually working with the file system. > Keyboard input can't map one key to several characters at once (unless you > randomly (-; decide which one to use) so input handling should use > one-to-one translation too. > Agreed. Just further fuel to the fire that both types of translations are needed (depending on the specific application, even if the application is "the kernel"), and that this is not a trivial matter. UniCode is not the panacea it's purported to be. Christian Masloch wrote: > > UTF-8 is independent of byte-order. The exact encoding (and byte-order) > should always either be implicit (in the interface's or format's > definition) or be marked in some way. > I don't think there is a way to automatically determine the encoding from the data itself, and the only way to determine the byte-order (assuming it's not UTF-8, not a single character, and is unknown from the context) is to include the special BOM (Byte Order Mark) character as the first character of the string. In fact, according to the UniCode spec, if the BOM is not included and the byte-order is not clear from the context, you're supposed to assume big-endian. For file system and similar applications, the interface could just always assume a specific format (probably either UTF-8 or UTF-16LE). For a general-purpose interface, though, you should be able to handle all different kinds of possibilities (including things like "UTF-24" and "UTF-64"). Also, even though you're dealing with DOS doesn't necessarily mean everything will be little-endian -- it depends on the source of the data. Certain hardware interfaces (like SCSI) are inherently big-endian, and data downloaded from external sources can be almost anything. Christian Masloch wrote: > > The definition of a string's length (possibly number of > bytes/words/dwords, number of code-points, number of "characters") need > not be addressed by such an interface. If there is a need for a buffer or > string length (see below) a new interface should just define that all > "length" fields/parameters give the length in bytes. > Another possibility is what my UNI2ASCI program does, which is accept strings terminated with a specific character (in my case, the UniCode NUL character, conceptually similar to ASCIIZ). A general-purpose program should provide more than one way to define a string's length. If you limit input to only certain encodings or byte-orders or string/character types, then it ceases to be "general-purpose". Maybe a general-purpose program is not what we're really talking about here, but I think one needs to be developed. Bret -- View this message in context: http://old.nabble.com/ASCII-to-unicode-table-tp3031p30341668.html Sent from the FreeDOS - Dev mailing list archive at Nabble.com. -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> Should the translation be "accurate" or should it be "useful"? I think it should be accurate for file systems. Such a "useful" translation is a good concept for displaying output (maybe even that of the DIR command) but not for actually working with the file system. Keyboard input can't map one key to several characters at once (unless you randomly (-; decide which one to use) so input handling should use one-to-one translation too. > From a technical perspective, you will also at a minimum need to concern > yourself with translating strings vs. translating single characters > (UniCode > strings can/should include an Endian-defining character at the > beginning, as > well as needing to define how the length of the string is determined), > UTF-8 > vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian. None of this is > trivial, > and I think this is WAY too complicated to be in the kernel -- it should > be > a separate program/driver. UTF-8 is independent of byte-order. The exact encoding (and byte-order) should always either be implicit (in the interface's or format's definition) or be marked in some way. The definition of a string's length (possibly number of bytes/words/dwords, number of code-points, number of "characters") need not be addressed by such an interface. If there is a need for a buffer or string length (see below) a new interface should just define that all "length" fields/parameters give the length in bytes. If there was a DOS (kernel) interface, it should probably accept a single character (usually one byte, two byte for DBCS) encoded in the currently selected code page and return a Unicode code-point. All code-points fit into a 24-bit (= 3-byte) number; though such an interface can be limited to Unicode's BMP (16-bit numbers (= words)) like the DOSLFN/VC tables. Of course there should be an "accurate" reverse interface which accepts a 24-bit (or 16-bit) number and returns a one- or two-byte character in the current code page if one exists for that Unicode code-point. Notably, some code pages might contain characters that should map to several code-points and some code-points might require more than two bytes when represented in the current code page's encoding. A string translation interface might therefore be more appropriate. (As an aside, this would solve the need for a DBCS kludge because multi-byte mappings could be supported intrinsically.) In this case, the interface should exactly define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) - applications have to figure out on their own what encoding their data uses. Regards, Christian -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
I think there's an even larger question than the technical implementation, in summary: Should the translation be "accurate" or should it be "useful"? Officially, I believe there is a precise one-to-one relationship between ASCII and Unicode, but there are dozens of Unicode characters that "look like" each ASCII character. In my UNI2ASCI program (included with my USB drivers) the translation tables, perhaps, go overboard. If it receives a UniCode character that looks (to me) to be close enough to one of the ASCII characters, or a "string" of ASCII characters, that I think it can be "reasonably" represented on screen, it gets translated. UNI2ASCI only translates one way (UniCode to ASCII), only works with Code Page 437, and is not one-to-one (a single UniCode character may be translated into a "string" of ASCII characters). Keeping this type of translation table totally in memory is probably impractical because of the amount of memory that would be needed. However, I think this type of translation should at least be an option available to the user. *** >From a technical perspective, you will also at a minimum need to concern yourself with translating strings vs. translating single characters (UniCode strings can/should include an Endian-defining character at the beginning, as well as needing to define how the length of the string is determined), UTF-8 vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian. None of this is trivial, and I think this is WAY too complicated to be in the kernel -- it should be a separate program/driver. Bret -- View this message in context: http://old.nabble.com/ASCII-to-unicode-table-tp3031p30335092.html Sent from the FreeDOS - Dev mailing list archive at Nabble.com. -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
On Sat, 27 Nov 2010, Eric Auer wrote: > You could even have a separately loaded CON driver that > keeps a full unicode font in XMS (with some caching of > recently used sections in faster memory maybe?). That would be something like what DOS/V does. It switches to VGA 640x480 mode and emulates the standard 80x25 console (with a larger 8x19/16x19 font), with the font stored in XMS. It would be quite difficult, I suppose, to implement, and all the TUI software would break unless it was coded to expect the possibility of a console being run in Unicode mode, so it would be necessary to be able to turn it on and off at will (again, like DOS/V). -uso. -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
> Now programs do it self by looking into own datafiles with .TBL > extension. Look at DOSLFN or Volkov commander 4.99. They have few files > like cp852uni.tbl, cp866uni.tbl and so on. > It is a very good solution but problem is that here is no way now how to > determine which file should be used. At least DOSLFN queries DOS for the currently used codepage and tries to load that table. This query is in its Int21 handler so it will catch codepage changes and try to load the new table then. > It fully relies on manual configuration. No. > Anothor point is that ASCII-unicode conversion should be somewhat > treated by OS, I think. I think it is not smart if every unicode program > has own TBL library. It should be one somewhere in FREEDOS derectories. Yes. > So how to solve it? > * let the user call function for international info, and by returned > codepage manualy decide which .TBL file to use? As currently done by DOSLFN. > * .TBL files should be in LANG or NLSPATH environment variable? A centralized location might be useful. It might also be possible to create a file format where several tables share one file. I think such a format could be a COUNTRY.SYS extension without breaking other users of that file. > * somehow extend the kernel function for international info to say which > .TBL files to use? > * preload .TBL into memory in COUNTRY initialization and even more > extend international info to provide ASCII- unicode conversion? Both would be useful. Such a table (if limited to Unicode's BMP, as DOSLFN's format currently is) needs 256 byte plus some info like what codepage the currently loaded table corresponds to. Regards, Christian -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
Re: [Freedos-devel] ASCII to unicode table
Hi Ladislav, > I think we should discuss how to implement unicode. There already is some interface for double byte chars in DOS, which we could implement. However, it was made for Chinese as far as I remember and needed support by more drivers even if you had a DBCS-enabled DOS version. > In the fact only one small thing is necesarry: we need a mechanism > for translating unicode chars into ASCII chars and vice versa. Technically speaking, that translation is "char 0-7f of unicode are ASCII, the rest are not". What you probably mean is "for any _other_ unicode char, if char 80-ff of the current font codepage looks similar enough, display that char"... Which is very limited, given that unicode has 1s of chars while any codepage only has max 128 non-ASCII chars for you. You could also support unicode for strings which are relevant for DOS... That would probably mean that you allow UTF8 in filenames. You could even use it without changing the kernel, if it is okay for you that search wildcards match a byte and not necessarily a character. The rest would depend on the ability of your CON driver to show UTF8 properly as far as the current font allows. You could even have a separately loaded CON driver that keeps a full unicode font in XMS (with some caching of recently used sections in faster memory maybe?). Note that many programs do not use CON, in particular if they want to have user interfaces with fancy layout. For example text editors do not normally use CON in DOS but you could have one which uses CON and needs NANSI. Actually you would want an UTF8-enabled super NANSI :-) > Now programs do it self by looking into own datafiles with .TBL > extension. Look at DOSLFN or Volkov commander 4.99. They have few > files like cp852uni.tbl, cp866uni.tbl and so on. As said above, that only allows you to display very few unicode chars - those which happen to be supported by your current codepage font. Still useful, of course. Be aware that UTF8 or unicode in general needs more bytes per character, so outside the LFN world, file names can reach their limit at less than 8+3 chars. But then, it is easy to load DOSLFN. > It is a very good solution but problem is that here is no > way now how to determine which file should be used. There is. DISPLAY has an interface to query the codepage. > It fully relies on manual configuration. See 2 lines above this. > Anothor point is that ASCII-unicode conversion should be somewhat > treated by OS, I think. I think it is not smart if every unicode > program has own TBL library. It should be one somewhere in FREEDOS > derectories. See above - but you could have some translation service. You could even have that UTF8 super NANSI described above but your soft then needs to understand the PRINCIPLE of UTF8. In other words, it has to understand in which way a sequence of two or more bytes can still mean only one character, which can be important for layout and search. > So how to solve it? > * let the user call function for international info, and by returned > codepage manualy decide which .TBL file to use? Such functions are available, yes. > * .TBL files should be in LANG or NLSPATH environment variable? Probably better to have a new variable for those, if any. > * somehow extend the kernel function for international info to say > which .TBL files to use? I would not put that in the kernel. Better in a driver. > * preload .TBL into memory in COUNTRY initialization and even more > extend international info to provide ASCII- unicode conversion? As above, if anything, this should be handled by a driver. Regards, Eric -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel
[Freedos-devel] ASCII to unicode table
I think we should discuss how to implement unicode. In the fact only one small thing is necesarry: we need a mechanism for translating unicode chars into ASCII chars and vice versa. Now programs do it self by looking into own datafiles with .TBL extension. Look at DOSLFN or Volkov commander 4.99. They have few files like cp852uni.tbl, cp866uni.tbl and so on. It is a very good solution but problem is that here is no way now how to determine which file should be used. It fully relies on manual configuration. Anothor point is that ASCII-unicode conversion should be somewhat treated by OS, I think. I think it is not smart if every unicode program has own TBL library. It should be one somewhere in FREEDOS derectories. So how to solve it? * let the user call function for international info, and by returned codepage manualy decide which .TBL file to use? * .TBL files should be in LANG or NLSPATH environment variable? * somehow extend the kernel function for international info to say which .TBL files to use? * preload .TBL into memory in COUNTRY initialization and even more extend international info to provide ASCII- unicode conversion? -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Freedos-devel mailing list Freedos-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freedos-devel