Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
Mark Davis [EMAIL PROTECTED] wrote: You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only remaining possibilities. So look at those: 1. In UTF-16LE, the text is perfectly legal Ken. 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀. Thus there are two legal interpretations of the text, if the only thing you know is that it is untagged. IF you have some additional information, such as that it could not be UTF-16LE, then you can limit it further. OK, let me try to understand this again. I'm sorry, you guys should know that I'm not just trying to be a gadfly, but despite my efforts I am still confused over whether an unlabeled, BOM-free sequence may or may not be treated as little-endian UTF-16. I think what Mark is saying is that, given Ken's byte sequence: 0x4B 0x00 0x65 0x00 0x6E 0x00 and some reason (heuristics, knowledge of platform, divine guidance, etc.) to believe that this is Unicode text represented in some flavor of UTF-16, I have my choice of: (a) treating it as either UTF-16BE or UTF-16 and decoding it as U+4B00 U+6500 U+6E00 (䬀攀渀), or (b) treating it as UTF-16LE and decoding it as U+004B U+0065 U+006E (Ken), *BUT* I must not *call* the sequence UTF-16, since that term is officially reserved for BOM-marked text which can be either little- or big-endian, or BOMless text which must be big-endian. Is that what I have been missing all along? It's perfectly OK for the text to be encoded and decoded this way, so long as nobody actually calls it UTF-16? If so, then I've probably been arguing over nothing. -Doug Ewell Fullerton, California
Re: unidata is big
andreas palsson wrote: Hi. I would just like to know if someone could give me a tip on how to structure all the unicode-information in memory? All the UNIDATA does contain quite a bit of information and I can't see any obvious method of which is memory-efficient and gives fast access. You might want to evaluate some of the open source libraries mentioned under Enabled Products on the unicode site. For my own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/) I've created a seperate table builder tool for each property or mapping. The tools organize data in planes, and for each plane all possible trie setups are determined (about 80 combinations of one, two or three stage tables). Then the cheapest setup is used. This still requires over 230kb to store all data (except character names and comments) from the following files: UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt, Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt, BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt, DerivedNormalizationProperties.txt, and DerivedJoiningType.txt. For some mappings I've stored 32 bit code points where 16 bit would have been enough, but I decided API uniformness is more important than memory efficiency. I wouldn't bother too much about memory efficiency; it's irrelevant these days. Even your mobile phone has enough memory to store all unicode data 10..20 times. Same thing for lookup speed. All you have to do to get it fast is to wait (a few seasons). Theo
Whence UniData.txt? (was Re: unidata is big)
Theo's comment leads me to a question I've pondered recently: Assumptions: Many apps, from independent sources, need to access the Unicode character data, A lot of these apps aren't overly concerned with the slight overhead of parsing the data as needed from Unicode-supplied data files directly. Similarly, such apps benefit from being able to easily upgrade to new Unicode releases by simply replacing the data files. It isn't very user-friendly to for every such app to store their own private copy of the character data files when a single shared copy would take up less space and be easier to maintain. It would seem to me that there is some value in establishing either (1) a standard location where programs can expect to find (or install) a local copy of the Unicode data files, or (2) a standard way to discover where such a local copy of these files exist. My preference would be (2), which would make it easy to configure a network of machines to share a single copy of the data files. Something as simple as an environment variable could work if developers were to agree on its name and semantics. (I understand there may be different mechanisms for different platforms, but it would be even better if a standard mechanism were cross platform). So, are there any conventions for this evolving? Or would anyone like to propose one? Bob On 24/04/2002 09:26:55 Theo Veenker wrote: andreas palsson wrote: I wouldn't bother too much about memory efficiency; it's irrelevant these days. Even your mobile phone has enough memory to store all unicode data 10..20 times. Same thing for lookup speed. All you have to do to get it fast is to wait (a few seasons). Theo
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
below — Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, April 23, 2002 23:02 Subject: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN) Mark Davis [EMAIL PROTECTED] wrote: You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only remaining possibilities. So look at those: 1. In UTF-16LE, the text is perfectly legal Ken. 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀. Thus there are two legal interpretations of the text, if the only thing you know is that it is untagged. IF you have some additional information, such as that it could not be UTF-16LE, then you can limit it further. OK, let me try to understand this again. I'm sorry, you guys should know that I'm not just trying to be a gadfly, but despite my efforts I am still confused over whether an unlabeled, BOM-free sequence may or may not be treated as little-endian UTF-16. I think what Mark is saying is that, given Ken's byte sequence: 0x4B 0x00 0x65 0x00 0x6E 0x00 and some reason (heuristics, knowledge of platform, divine guidance, etc.) to believe that this is Unicode text represented in some flavor of UTF-16, I have my choice of: (a) treating it as either UTF-16BE or UTF-16 and decoding it as U+4B00 U+6500 U+6E00 (䬀攀渀), or (b) treating it as UTF-16LE and decoding it as U+004B U+0065 U+006E (Ken), *BUT* I must not *call* the sequence UTF-16, since that term is officially reserved for BOM-marked text which can be either little- or big-endian, or BOMless text which must be big-endian. Yes, assuming the BUT clause applies to (b). That is, the untagged byte sequence 0x4B 0x00 0x65 0x00 0x6E 0x00 could be (a) U+4B00 U+6500 U+6E00 (䬀攀渀): UTF-16BE or UTF-16 (b) U+004B U+0065 U+006E (Ken): UTF-16LE (c) U+004B U+ U+0065 U+ U+006E U+ (Knullenullnnull): ASCII, UTF-8, CP-1252, etc. (d) ...: EBCDEC If I really wanted to find out all the things it could be, I could run it through the 700+ converters in ICU and capture all the cases that don't detect illegal byte sequences. Except that the vast majority of these are very unlikely because they would produce nulls in the code point sequence. Is that what I have been missing all along? It's perfectly OK for the text to be encoded and decoded this way, so long as nobody actually calls it UTF-16? If so, then I've probably been arguing over nothing. Not really arguing, just exploring the issues. But one key is that if you are in an environment where untagged data is being exchanged (a bad idea, anyway), *and* the convention for that environment is to use the BOM (in either UTF-8, UTF-16, or UTF-32) thus excluding the possibility of the explicit LE or BE forms, then that would further winnow down the number of possible interpretations of untagged text. In this case, that would select the (a) interpretation. One real problem we have is that the *encoding form* UTF-16 and the *encoding scheme* UTF-16 are very different, but have the same name. If we had an explicit name for one or the other that would help to reduce the confusion. (We also don't have a name to distinguish the BOMed UTF-8 from the unBOMed, but that seems to cause less confusion.) -Doug Ewell Fullerton, California
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
Mark Davis [EMAIL PROTECTED] wrote: I must not *call* the sequence UTF-16, since that term is officially reserved for BOM-marked text which can be either little- or big-endian, or BOMless text which must be big-endian. Yes, assuming the BUT clause applies to (b). That is, the untagged byte sequence 0x4B 0x00 0x65 0x00 0x6E 0x00 could be (a) U+4B00 U+6500 U+6E00 (䬀攀渀): UTF-16BE or UTF-16 (b) U+004B U+0065 U+006E (Ken): UTF-16LE (c) U+004B U+ U+0065 U+ U+006E U+ (Knullenullnnull): ASCII, UTF-8, CP-1252, etc. (d) ...: EBCDEC Yes, that's what I meant to say. Not really arguing, just exploring the issues. But one key is that if you are in an environment where untagged data is being exchanged (a bad idea, anyway), But not all mechanisms for exchanging data allow tagging. (Bumper sticker: UNTAGGED TEXT HAPPENS) Here's what caused me to exhume this discussion. Ken made a joke: -- K '\0' e '\0' n '\0' (which I enjoyed) in response to the UNICODE BOMBER STRIKES AGAIN satire about blank squares infiltrating otherwise good text. This representation of Ken in untagged, little-endian UTF-16, misinterpreted as a sequence of 8-bit characters, corresponds to Mark's example (c) above. It *is* a misinterpretation, right? You're not really supposed to read this sequence of six bytes as K '\0' e '\0' n '\0'. That was the whole joke. And in fact, there is only one correct interpretation in this example (that is, only one interpretation that matches the sender's intent), and that is U+004B U+0065 U+006E. I contend that U+4B00 U+6500 U+6E00, whether it makes sense semantically in Chinese or not, is just as incorrect in this context as an ASCII, EBCDIC, FIELDATA, or BOCU-1 reading. Note that everything I said before about this example is true: - there is no BOM - there is no external tagging as UTF-16LE (or anything else) - we don't know the native byte orientation of the sender's machine There's a lot of text like this out there, not all of which is intended as jokes or even illustrations. The Unix and Linux world is very opposed to the use of BOM in plain-text files, and if they feel that way about UTF-8 they probably feel the same about UTF-16. Note also that heuristics in an example like this can be deceiving. A famous heuristic that applies to this example is to notice that every other byte is 0, and therefore treat the text as UTF-16LE. For example, one could take the big-endian interpretation (U+4B00 U+6500 U+6E00), notice that all of these characters are CJK ideographs, and use that to deduce (incorrectly) that the text should be UTF-16BE. What if the text were reversed? ('\0' K '\0' e '\0' n) The latter heuristic would suggest that the text should be UTF-16LE. Heuristics are not perfect, but sometimes they're all we've got. So Ken's joke is encoded in BOMless, little-endian, non-externally-tagged UTF-16. It's a perfectly legal Unicode representation, but we can't call it UTF-16 because that term implies big-endian. This sounds legalistic, sort of like the warnings on the Unicode Web site about the correct use of the word Unicode. But at least I think I understand the issues a little better, and so the exploration effort paid off. -Doug Ewell Fullerton, California
variations of UTF-16/UTF-32 and browsers' interpretation (was Re:browsers and unicode surrogates)
On Mon, 22 Apr 2002, Stefan Persson wrote: I haven't added plane 1 characters, yet (Tex let me do that, thanks !). However, my test pages can be used to test how various web browsers interpret various forms of UTF-16 and UTF-32 with or without BOM and with or without external info. (such as MIME charset in http C-T header). This is not of practical importance/interest(UTF-8 is much less ambigous and better supported than UTF-16/32 by various web browsers), but it's interesting nonetheless because the way various forms of UTF-16/32 have to be interpreted has been discussed recently. - Original Message - From: [EMAIL PROTECTED] Sent: den 22 april 2002 20:24 Thank you for this tip. I didn't know this and ended up 'cluttering' my filenames with charset suffices at http://jshin.net/i18n/utftest. The following pages display Korean text: * All UTF-16 with BOM * All UTF-32LE with BOM * UTF-16LE without BOM, encoding specified as UTF-16 The following pages are displayed as Latin-1 jibberish, ASCII displayed properly: * UTF-16 without BOM, encoding specified as UTF-16LE, UTF-16BE, or not specified at all * All UTF-32BE * All UTF-32LE without BOM This page is misinterpreted as UTF-16LE without line breaking: * UTF-16BE without BOM, encoding specified as UTF-16 I'm using IE 5.5 under Windows 98. Thank you for your test result. MS IE 5.5. seems to *ignore* MIME charset specified in http header. It appears to *solely* rely on the presence of BOM. If it's not specified, it assumes the platform byte order. Is this behavior compatible with what Mark and Ken described as to how to interpret various forms of UTF-16 and UTF-32 last week and this week again? It doesn't seem to be. The way Mozilla interprets various forms of UTF-16|32 appears to be more in line with what Mark and Ken have written although there are some issues to be resolved as well. It'll be interesting to see how Opera does. Here's the test result with Mozilla 0.9.9 on ix86 Linux (that is, the platform byte order is the same as your case). * The following pages always get displayed as intended - All UTF-16's and UTF-32's with MIME charset (*with* endian at the end. i.e. UTF-32(LE|BE), UTF-16(LE|BE) ) specified in http header regardless of the endian and the presence of BOM (In UTF-32 pages, BOM is NOT ignored and rendered as 'ZWNBS' enclosed by a dotted square) : 8 cases - UTF-16BE with BOM but without MIME charset specified : 1 cases - UTF-16BE and UTF-32BE without BOM but MIME charset specified as UTF-16 and UTF-32 : 2 cases - UTF-16BE and UTF-32BE with BOM but MIME charset specified as UTF-16 and UTF-32 : 2 cases * For the following pages, auto-detection sometimes works but not always. - UTF-16LE and UTF-32LE with BOM but without MIME charset specified : 2 cases - UTF-32BE with BOM but without MIME charset specified : 1 cases * The following pages are recognized as Latin-1. US-ASCII characters are rendered correctly with one or three hollow boxes before or after each of them depending on the endian(BE/LE) and the size (16/32) - UTF-16LE and UTF-32LE without BOM and without MIME charset (2 cases) - UTF-16BE and UTF-32BE without BOM and without MIME charset (2 cases) * The following pages are recognized as UTF-16BE and UTF-32BE. - UTF-16LE and UTF-32LE without BOM but with MIME charset specified as UTF-16 and UTF-32 (2 cases) - UTF-16LE and UTF-32LE with BOM but with MIME charset specified as UTF-16 and UTF-32 (2 cases) Jungshik Shin
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
On Wed, Apr 24, 2002 at 09:00:17AM -0700, Doug Ewell wrote: The Unix and Linux world is very opposed to the use of BOM in plain-text files, and if they feel that way about UTF-8 they probably feel the same about UTF-16. Why? The problems with a BOM in UTF-8 have to do with it being an ASCII-compatible encoding. (I'd guess that if there are any Unixes that use EBCDIC, the same problems would apply to UTF-EBCDIC.) Pretty much the only reason one would use UTF-16 is to be compatible with a foreign system, and then you use the conventions of that system. Also, look at the output of file: n2404r.doc: Microsoft Office document data file.utf8: UTF-8 Unicode English text file.utf16: Little-endian UTF-16 Unicode English character data file.iso: data file_list: ASCII text There's basically two categories here; data or text. But UTF-16 is not considered text; it's considered data, like a Word file. Most Unix users would treat a UTF-16 encoded file the same way; as a format to be converted from, or edited in a word processor only. -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet)
Re: Whence UniData.txt? (was Re: unidata is big)
[EMAIL PROTECTED] wrote: Theo's comment leads me to a question I've pondered recently: Assumptions: Many apps, from independent sources, need to access the Unicode character data, A lot of these apps aren't overly concerned with the slight overhead of parsing the data as needed from Unicode-supplied data files directly. Similarly, such apps benefit from being able to easily upgrade to new Unicode releases by simply replacing the data files. It isn't very user-friendly to for every such app to store their own private copy of the character data files when a single shared copy would take up less space and be easier to maintain. It would seem to me that there is some value in establishing either (1) a standard location where programs can expect to find (or install) a local copy of the Unicode data files, or (2) a standard way to discover where such a local copy of these files exist. My preference would be (2), which would make it easy to configure a network of machines to share a single copy of the data files. Something as simple as an environment variable could work if developers were to agree on its name and semantics. For applications that eat raw UCD files, this shouldn't be to difficult to achieve. Any well designed app will/should have some parameter or env. variable that you can set (no?). But for apps/libraries that like their UCD files cooked it is a different story because there is no recommended binary format for representing (compact) unicode character data. Personally I would appreciate seeing such a recommendation including your point (2). However apps/libs which enrich the character data with custom properties, would still need their own copy of the data. The subject reminds me of the TZ database. Here you have a large text based database containing information on time zones and daylight saving times. You can compile the data into a binary format by running a utility included with the tz sources. Well, they don't give any recommendation on where to store the (text and/or binary) data, but at least there is a 'standard' format, which allows for sharing data. Would be nice to have something like this for the UCD. (I understand there may be different mechanisms for different platforms, but it would be even better if a standard mechanism were cross platform). So, are there any conventions for this evolving? Or would anyone like to propose one? Please, go ahead :o) Theo
RE: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
Why? The problems with a BOM in UTF-8 have to do with it being an ASCII-compatible encoding. Err, no. That's not the point, AFAIK. The point is that traditionally in UNIX there hasn't been any sort of marker or tag in the beginning, UNIX files being flat streams of bytes. The UNIX toolset has been built with this principle in mind. No metadata in the files. BOM breaks this. cat file1 file2 file3 file4 will have three BOMs, two of them in the middle of file4. wc -c file1 would have to skip the BOM not get the a wrong byte count. sort -o file5 file1 would have to strip the BOM from file1 (but put in pack into file5?) And so forth. If you have a multifork filesystem, you can do tagging like this easily since the real payload doesn't get mixed with the metadata. But traditional UNIX filesystems do not have multifork filesystems.
RE: UNICODE BOMBER STRIKES AGAIN
You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only remaining possibilities. So look at those: 1. In UTF-16LE, the text is perfectly legal Ken. 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀. Thus there are two legal interpretations of the text, if the only thing you know is that it is untagged. IF you have some additional information, such as that it could not be UTF-16LE, then you can limit it further. Actually, I also think that without any external information about the encoding except that it is some UTF-16, it *has to* be interpreted as being most significant byte first. I agree that it could be either UTF-16LE or UTF-16BE/UTF-16, but in the absence of any other information, at this point in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no choice but to declare it UTF-16. Now what about auto-detection in relation to this conformance clause? Readers that first try to be smart by auto-detecting encodings could of course pick any of these as the 'auto-detected' one. Does that violate 3.1 C3's interpretation of bytes? I would say that as long as the auto-detector is seen as a separate process/step, one can get away with it, since by the time you look at the bytes to process the data, their encoding has been set by the auto-detector. YA
L2 / UTC document register updated
The document register has been updated again... http://www.unicode.org/L2/L-curdoc.htm Several new documents: L2/02-150 Status of Mapping between Characters of ISO 5426-2... L2/02-151 Comparison of Characters of ISO 6861 and Those Proposed... L2/02-152 Status of Mapping between Characters of ISO 8957 - Table 2... L2/02-153 Status of Mapping between Characters of ISO 10574... L2/02-154 Draft minutes of WG 2 meeting 41 (Singapore) L2/02-155 Proposal to add 1 Hanja code of D P R of Korea... (1) L2/02-156 Proposal to add 1 Hanja code of D P R of Korea... (2) L2/02-157 Status in Myanmar on n2033 L2/02-158 WG2 - Pre-Meeting M42 Action Items List Zipdocs for 121-140 are also available. http://www.unicode.org/L2/Zipdocs/zipdocs.htm Rick
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
On Wed, Apr 24, 2002 at 01:37:39PM -0400, [EMAIL PROTECTED] wrote: Err, no. That's not the point, AFAIK. The point is that traditionally in UNIX there hasn't been any sort of marker or tag in the beginning, UNIX files being flat streams of bytes. The UNIX toolset has been built with this principle in mind. No metadata in the files. BOM breaks this. Not at all true. Look at the head of a PNM file, a quintessentailly Unix file format. PNM, MP3 or PNG files all have metadata identifying them, and don't break under Unix systems. wc -c file1 would have to skip the BOM not get the a wrong byte count. sort -o file5 file1 would have to strip the BOM from file1 (but put in pack into file5?) The wrong byte count? wc -c file1 is basically meaningless on a Unicode file, but at least you can assume it gives the _byte count_ (including extraneous things like BOMs). More importantly, how do these programs handle newlines? wc -l counts the number of \x0A's in the file; sort splits the file based on \x0A. This will produce nothing of value on a UTF-16 file. They could be changed to work with UTF-16, but they won't be, as UTF-8 works just fine. The point about file calling it data, not text, was just this; you can't expect to throw UTF-16 through text tools and get a meaningful result. That's why UTF-8 was created. The only sane thing to do with a UTF-16 file on Unix is treat as binary data, just like you would a word-processor file. (Which are stunningly non-Unix, but coming nonetheless. Probably for the best, though.) -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet)
Re: UNICODE BOMBER STRIKES AGAIN
Unfortunately, the language in C3.1 is a bit archaic; it is referring specifically to the UTF-16 encoding scheme. If you know you are working with UTF-16, and you have no other information, then you do have to use big-endian. If, however, you only know that it is one of UTF-16BE, UTF-16LE, or UTF-16 (plain)), then there are more choices. Similarly, if you know that the text is limited to one of UTF-32LE or UTF-16LE, then you actually know that the text must be little-endian. Mark — Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com - Original Message - From: Yves Arrouye [EMAIL PROTECTED] To: 'Mark Davis' [EMAIL PROTECTED]; Doug Ewell [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, April 24, 2002 10:39 Subject: RE: UNICODE BOMBER STRIKES AGAIN You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only remaining possibilities. So look at those: 1. In UTF-16LE, the text is perfectly legal Ken. 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀. Thus there are two legal interpretations of the text, if the only thing you know is that it is untagged. IF you have some additional information, such as that it could not be UTF-16LE, then you can limit it further. Actually, I also think that without any external information about the encoding except that it is some UTF-16, it *has to* be interpreted as being most significant byte first. I agree that it could be either UTF-16LE or UTF-16BE/UTF-16, but in the absence of any other information, at this point in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no choice but to declare it UTF-16. Now what about auto-detection in relation to this conformance clause? Readers that first try to be smart by auto-detecting encodings could of course pick any of these as the 'auto-detected' one. Does that violate 3.1 C3's interpretation of bytes? I would say that as long as the auto-detector is seen as a separate process/step, one can get away with it, since by the time you look at the bytes to process the data, their encoding has been set by the auto-detector. YA
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
On Wed, 24 Apr 2002, David Starner wrote: On Wed, Apr 24, 2002 at 09:00:17AM -0700, Doug Ewell wrote: The Unix and Linux world is very opposed to the use of BOM in plain-text files, and if they feel that way about UTF-8 they probably feel the same about UTF-16. The reason we're not so fond of UTF-8 with BOM is that it 'breaks' a lot of time-honored Unix command line text-processing tools. The simplest example is concatenating multiple files with 'cat'. With BOM at the beginning, the following doesn't work as intended. $ cat f1 f2 f3 f4 | sort | uniq | sed '' f5 For Sure, by typing a couple of more commands(enclosing 'cat' with 'for loop', for instance), we can work around that, but Why? The problems with a BOM in UTF-8 have to do with it being an ASCII-compatible encoding. (I'd guess that if there are any Unixes that use EBCDIC, the same problems would apply to UTF-EBCDIC.) Pretty much the only reason one would use UTF-16 is to be compatible with a foreign system, and then you use the conventions of that system. I totally agree with you. We don't expect text tools to work on files in UTF-16 the same way as we would expect them to work on files in UTF-8 or other ASCII-compatible encodings. Jungshik Shin
Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)
Doug Ewell scripsit: The Unix and Linux world is very opposed to the use of BOM in plain-text files, and if they feel that way about UTF-8 they probably feel the same about UTF-16. I doubt it. The trouble with BOMizing is that it makes ASCII not a subset of UTF-8, but ASCII cannot be a subset of UTF-16 anyhow. (I mean at the byte level, of course.) -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
Re: variations of UTF-16/UTF-32 and browsers' interpretation (wasRe: browsers and unicode surrogates)
Following is result of Mac OS X and OmniWeb browser on http://jshin.net/i18n/utftest 5 cases of proper display: +Y, BE, 16, UTF-16 and UTF-16BE +Y, LE, 16, UTF-16 +N, BE, 16, UTF-16 and UTF-16BE All the rest showed only the ascii correctly.
Re: variations of UTF-16/UTF-32 and browsers' interpretation (was Re: browsers and unicode surrogates)
At 01:42 +0100 2002-04-25, Michael Everson wrote: On http://jshin.net/i18n/utftest/bom_utf16be.utf16.html under OS X you don't see just question marks, though -- you see the Last Resort font showing that Korean characters not present in the font are in the text. Awesome. In OmniWeb at least. (Forgot to mention it.) OmniWeb does a very nice job on http://www.evertype.com/standards/iso15924/document/scriptbib.html by the way (I know I have to edit the Arabic still, and no, Omniweb doesn't order it properly in RTL though it does try to apply shaping behaviour.) -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Variations of UTF-16
On Wed, Apr 24, 2002 at 05:12:43PM -0700, Jonathan Coxhead wrote: But a BOM in every UTF-16 plain text file would make this completely hopeless. If we ever think we might want to do UNIX-style text processing on UTF-16, we have to resist that! So the Unix people, because they might someday want to use UTF-16 plain text (why? may as well go to UTF-32), should object to somebody else using a BOM on a file format they actually use? -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet)
Re: Variations of UTF-16/UTF-32 and browsers interpretation
At 01:42 +0100 2002-04-25, Michael Everson wrote: On http://jshin.net/i18n/utftest/bom_utf16be.utf16.html under OS X you don't see just question marks, though -- you see the Last Resort font showing that Korean characters not present in the font are in the text. Awesome. Not sure which font is doing it (Code2000 perhaps), but I can see all of them. In OmniWeb at least. (Forgot to mention it.) OmniWeb does a very nice job on http://www.evertype.com/standards/iso15924/document/scriptbib.html If you try it with Mozilla I think the Arabic will come out much better (but disable the font Code2000 first if you have it.)
Re: Variations of UTF-16
{{ But a BOM in every UTF-16 plain text file would make this completely hopeless. If we ever think we might want to do UNIX-style text processing on UTF-16, we have to resist that! }} If you're going to take the trouble of making text tools 16-bit aware, then you can afford to make them BOM-aware too. type a.txt b.txt c.txt d.txt on Windows 2000, assuming that they are all UTF-16 (with an FFFE at the beginning of each, as is usual in MS-Windows Unicode files), strips every BOM except the last, so that d.txt has only the usual one initial FFFE. So it's not an immovable obstacle. Concerning text files: nearly all of plain-text Unicode I've ever seen is in UTF-8. However, the ubiquitous MS-Office documents, from Office 2000 onwards, are all in UTF-16 (little-endian, without BOM). _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com