Char Set Detector
Hi, Can any body give me the path/url to get a Char set Detector. which can detect the char set's encoding scheme ..atleast for shift_jis and big5. I downloaded from mozilla and build it. It's working fine for UTF8 but failing for shift_jis and big5 in some cases. I'm working on HP unix. If any body is working on same type of work let me know please. Thanks, -Yogesh
RE: missing .GIF's for ideographs on unicode.org?
Richard wrote: Erik, I think you are correct. The link should be like so: http://www.unicode.org/cgi-bin/refglyph?24-2 I'm guessing this just hasn't been implemented yet. I swear I've seen the glyph on this page before. When I looked at it in the PDF I immediately recognized it, and I'm not a Chinese speaking-kind-of-guy.
Re: [Private Use Area] Audio Description, Subtitle, Signing
Michael Everson raises some interesting points. William. If CENELEC wishes to standardize a set of icons, they will do so. If they have a need to interchange data using those icons, they will (if they are wise) come to us an ask to encode them. If they want to use the Private Use Area before they do that, they will. Perhaps I may explain the situation? The European Commission asked Cenelec to conduct a project about establishing a process to implement interactive television in the European Community. A consultancy was asked to produce a report. Cenelec organized a forum for the report to be discussed and also arranged an open meeting which was held in Brussels on 12 March 2003. The report was made available in the forum before the meeting. I did not attend the meeting, though I did post some comments into the forum before the meeting. A list of the people due to attend the meeting was published in the forum prior to the meeting. Most, though not all, are representing organizations. After the meeting a later version of the report was produced. The report suggests various aspects of the necessary work be done by various existing committees and standards bodies. The forum remains in use. One participant recently put forward the idea of agreeing on logos for Audio Description, Subtitle, Signing and provides a link to the page which I mentioned earlier in this thread. I added the suggestion of adding the symbols into the built-in font of DVB-MHP televisions. I suggested the desirability of regular Unicode code point allocations. I mentioned the time scale and I mentioned the Private Use Area code points for the symbols, that is one code point for each of Audio Description, Subtitle, Signing, not one code point for each of the logos being considered. I suggested some specific code points. I like to think that I quite clearly stated my interest in choosing those suggested code points so as not to clash with my suggestions for other uses of Private Use Area code points. For the avoidance of doubt, my suggestions for using Private Use Area code points for eutocode graphics do not need to be accepted by a standards body as they are only meaningful for use in text files which customize those Java programs which recognize them as they do not need to access the built-in font of the television set and they are only used in programs where the eutocode graphics system is used. There are glyphs for authoring-time, but no glyphs are needed for run-time use as the codes activate graphics features at run-time. So the specific suggestion for code points for Audio Description, Subtitle, Signing is within the forum. Now, I am unsure at present as to how the various committees and standards bodies are to proceed, yet I have sought to place my ideas in the forum in the hope that they will reach the agendas of the meetings. I also, some time later, posted the code point suggestions here. I did this with the intention of making the suggested code points more widely known. Please don't tell us all about it over and over again, as you have done. Well, I made one post at the start of this thread. After that I have just responded to comments which have been made. If you want to talk to CENELEC, do so. Please stop trying to peddle your PUA schemes for CENELEC to us. Well, I had not thought of it as peddling! I maintain the ConScript Unicode Registry, which contains PUA assignments. I do not promulgate those on this list. (Apart from that fun testing of the Phaistos implementation some time ago.) Well, that is a matter for you. Actually, I rather enjoy reading your postings and would indeed be pleased if you did post them in this forum. However, I do understand your point, though as this mailing list has rules which encourage discussions amongst users of the Unicode Standard, I feel that you are being somewhat harsh in your criticism. Anyway, you only joined in to a thread about the Phaistos Disc script, you did not start it as I seem to remember. So you were responding to an enquiry and, indeed, I feel, being very helpful in adding Phaistos Disc script into the ConScript Unicode Registry. It will be fascinating to observe what happens if an archaeological dig somewhere, maybe nowhere near where the Phaistos Disc was discovered, produces a lot of items with Phaistos Disc script upon them. Then instead of Unicode being ready, there will then be a long wait for a regular Unicode implementation, when it could have been done years ago! The taboo of discussing PUA code points which some people have causes lots of unnecessary problems. For example, the Unicode Consortium is so taboo-avoiding about mentioning PUA assignments that when I had some time ago heard that Microsoft used part of the PUA in a special way in symbol fonts I had enormous difficulty in finding out where it was located! Also, I seem to remember a long time ago trying to find out where it was that I had seen Tengwar encoded in the
Re: missing .GIF's for ideographs on unicode.org?
On Thursday, July 17, 2003, at 12:00 AM, Richard Cook wrote: I'm guessing this just hasn't been implemented yet. You are guessing correctly. Once some of the dust settles from my day job, I expect I can get to this. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
Re: [Private Use Area] Audio Description, Subtitle, Signing
William spilled another ocean of digital ink. Found bobbing in that ocean was the comment: Roozbeh and I assigned two unencoded characters for Afghanistan to the PUA, and we encourage implementors to use them until such time as the characters are encoded. Yes. ... Now that at least one of them has been approved for encoding by the Unicode Technical Committee there is now a long period of waiting during which Private Use Area encoded data can be produced. This does seem unfortunate and for individual symbols such as these I would hope that the people who are in charge of Standards might like to consider asking if the United Nations and the World Trade Organization could perhaps arrange for some faster way of achieving agreement. *rolls eyes* It seems rather unlikely that getting the United Nations and the World Trade Organization involved in trying to amend JTC1 standards directives would be a recipe for speeding anything up. :-) It does seem so very slow for the twenty-first century with so many electronics communications facilities! Why does legacy data have to build up and resolving the problem take so long for just a few symbols? Because amending and updating a standard is effectively the same task whether it involves 1 additional character or 181 additional characters. There are a large number of stages, approvals, reviews, and other tasks involved -- which are there for a reason, to ensure the stability and orderly maintenance of the standard. I would have thought that with a reasonable infrastructure that those two code points could have been formally added into regular Unicode and ISO within a couple of weeks. The whole idea of adding a couple code points this week and then a couple more next week, and then another next month, and so on is, well, just nuts. It would destroy effective version control and would create a situation where implementers were unsure just what was in the standard and when it would change further. It would *damage* the standard rather than improve anything. A character encoding standard is not just a laundry-list registration of characters that people happen to notice this week. As such, it is not advisable to create a mechanism whereby new characters are noticed, approved, and registered on a weekly basis. An ocean of digital ink! I like that phrase. As well as producing the oceans, clearly. That person added that people have been telling me for a long time that PUA codes are not suitable for interchange. Not suitable for *public* interchange, because, by definition, in public interchange the receiver will not be a party to whatever *private* agreement defines their usage, and so will not be able to interpret them. That puzzles me, because I thought that it was alright to interchange Private Use Area codes if there is an agreement as to their meaning in a particular situation. Yes, a *private* agreement for *private* interchange. That, as Michael tried to tell you, is why we call them *private* use characters. Also, Unicode 3.0 mentions the possibility of publication of Private Use Area assignments Anyone is free to publish anything they wish, including lists of PUA assignments. in the section on the Private Use Area. But the Unicode Consortium will not publish such lists in the Unicode Standard or on its website in any official way. So what is the official position please? I just stated it. If you want chapter and verse: All code points in the blocks of private-use characters in the Unicode Standard are permanently designated for private use--no assignment to a particular, standard set of characters will ever be endorsed or documented by the Unicode Consortium for any of these code points. -- The Unicode Standard, Version 4.0, Section 15.7, Private-Use Characters, p. 398, 2003 [forthcoming] This is important to me because I have been proceeding in the belief that suggesting three Private Use Area code points for use in interactive television systems is entirely proper and compliant with Unicode and the ISO standard. It is. But other participants on this email list have been telling you that they are not interested in your *particular* use of private use characters. --Ken
Re: [Private Use Area] Audio Description, Subtitle, Signing
At 17:01 +0100 2003-07-17, William Overington wrote: Michael Everson raises some interesting points. William. If CENELEC wishes to standardize a set of icons, they will do so. If they have a need to interchange data using those icons, they will (if they are wise) come to us an ask to encode them. If they want to use the Private Use Area before they do that, they will. Perhaps I may explain the situation? No, thank you. If CENELEC wants to propose characters to the Unicode Standard, they can contact us. I'd be interested in helping, if they had a good case. But I'm not looking for extra work right now. Now, I have never heard of the MES-2 whatever that is. However, I do not have deep knowledge of the various standards which exist. Could you possibly say some more about MES-2 please. A.4.2 282 MES-2 282 MES-2 is specified by the following ranges of code positions as indicated for each row. Rows Positions (cells) 00 20-7E A0-FF 01 00-7F 8F 92 B7 DE-EF FA-FF 02 18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D8-DD EE 03 74-75 7A 7E 84-8A 8C 8E-A1 A3-CE D7 DA-E1 04 00-5F 90-C4 C7-C8 CB-CC D0-EB EE-F5 F8-F9 1E 02-03 0A-0B 1E-1F 40-41 56-57 60-61 6A-6B 80-85 9B F2-F3 1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 F6-FE 20 13-15 17-1E 20-22 26 30 32-33 39-3A 3C 3E 44 4A 7F 82 A3-A4 A7 AC AF 21 05 16 22 26 5B-5E 90-95 A8 22 00 02-03 06 08-09 0F 11-12 19-1A 1E-1F 27-2B 48 59 60-61 64-65 82-83 95 97 23 02 10 20-21 29-2A 25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 AC B2 BA BC C4 CA-CB D8-D9 26 3A-3C 40 42 60 63 65-66 6A-6B FB 01-02 FF FD -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Article on Unicode in Globalization Insider
On Wed, Jul 16, 2003 at 01:01:30PM -, [EMAIL PROTECTED] wrote: http://www.lisa.org/archive_domain/newsletters/2003/ 3.2/lommel_unicode.html This link seems to be broken. I get a message *Our apologies* *The page you requested is not available.* I guess you just have to combine the whole URL properly into one line. it works for me (lynx). -- Thomas E. Dickey [EMAIL PROTECTED] http://invisible-island.net ftp://invisible-island.net
Re: missing .GIF's for ideographs on unicode.org?
Ostermueller, Erik wrote: At unicode.org, when I click this link, http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=2 I'm expecting to see a little square GIF that displays U+2. Instead, I see N/A. This has now been fixed. Thank you for pointing out the error. The code was only showing glyphs = 0x, even though the glyph database has all of the plane 2 Han glyphs in it. Rick
Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)
On Thursday, July 17, 2003 9:23 PM, Michael Everson [EMAIL PROTECTED] wrote: At 17:01 +0100 2003-07-17, William Overington wrote: Now, I have never heard of the MES-2 whatever that is. However, I do not have deep knowledge of the various standards which exist. Could you possibly say some more about MES-2 please. 282 MES-2 is specified by the following ranges of code positions as indicated for each row. Rows: Positions (cells) 00: 20-7E A0-FF 01: 00-7F 8F 92 B7 DE-EF FA-FF 02: 18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D8-DD EE 03: 74-75 7A 7E 84-8A 8C 8E-A1 A3-CE D7 DA-E1 04: 00-5F 90-C4 C7-C8 CB-CC D0-EB EE-F5 F8-F9 1E: 02-03 0A-0B 1E-1F 40-41 56-57 60-61 6A-6B 80-85 9B F2-F3 1F: 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 F6-FE 20: 13-15 17-1E 20-22 26 30 32-33 39-3A 3C 3E 44 4A 7F 82 A3-A4 A7 AC AF 21: 05 16 22 26 5B-5E 90-95 A8 22: 00 02-03 06 08-09 0F 11-12 19-1A 1E-1F 27-2B 48 59 60-61 64-65 82-83 95 97 23: 02 10 20-21 29-2A 25: 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 AC B2 BA BC C4 CA-CB D8-D9 26: 3A-3C 40 42 60 63 65-66 6A-6B FB: 01-02 FF: FD As most of these characters are canonically decomposable, shouldn't this list include also the decomposed characters? Why is row 03 so resticted? Shouldn't it include those accents and diacritics that are used by other characters once canonically decomposed? Or does it imply that MES-2 is only supposed to use strings if NFC form? Also, is this list under full closure with existing character properties, like NFKD decompositions, and case mappings? -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)
282 MES-2 is specified by the following ranges of code positions as indicated for each row... Philippe Verdy asked: As most of these characters are canonically decomposable, shouldn't this list include also the decomposed characters? Why is row 03 so resticted? Shouldn't it include those accents and diacritics that are used by other characters once canonically decomposed? Or does it imply that MES-2 is only supposed to use strings if NFC form? MES-2 (and all the rest of the Multilingual European Subsets) are a CEN construct. See the CEN Workshop Agreement, CWA 13873:2000 posted at Michael Everson's site: http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf Among other things, that CWA states: This CWA does *not* specify any encoding of the European Subsets. so conceptually it is more like a repertoire listing. MES-2 is formally listed in 10646 as one of the normative subsets there, but since 10646 has no concepts of decomposition, normalization, or equivalence, the fact that MES-2 contains precomposed characters but not their decompositions or the relevant combining accents is formally irrelevant. The Unicode Standard does not make subsets a normative construct for that standard and doesn't even mention MES-2. Conformance to 10646 doesn't require you to make use of its subsets, but if anyone is worried about the articulation of the standards, the Unicode Standard itself formally consists of Subset 305 of 10646:2003, namely the UNICODE 4.0 subset -- the subset which contains *all* of the encoded characters of 10646:2003. Think of the Multilingual European Subsets as a kind of way for people in Europe associated with standards organizations and governments to try to communicate with software vendors regarding which user characters they want to ensure are supported by their software. The CWA 13873 contains some questionable presuppositions about how software vendors are actually proceeding to roll out their Unicode support, but the intent of the CWA is clear: It is estimated that implementing the full character set of the UCS may be costly in the first stages of UCS use, and that many manufacturers will implement in subset-stages. To ensure that a common subset usable to the vast majority of European users be available for a reasonable price, and as a guide to manufacturers, it will be helpful to specify, to users and procurers of systems, European subsets of the UCS encompassing the characters for use in European languages as well as other frequently used and specialist characters. Also, is this list under full closure with existing character properties, like NFKD decompositions, and case mappings? MES-2 is clearly *not* closed under NFD, NFKD, or NFKC normalizations. Although less obvious, it is also not closed under NFC normalization. For example, it includes the angle brackets U+2329, U+232A, but not their canonical equivalents, U+3008, U+3009. There are also some characters outside the MES-2 repertoire where NFC(x) *is* in the MES-2 repertoire. Singleton canonical equivalences like U+212B ANGSTROM SIGN come to mind, for example. I haven't checked on case mappings and case foldings, but would not be too surprised to find an anomaly or two there, as well. MES-2 was not designed by the UTC, nor did it take any of these considerations into account. It is not really an appropriate construct for the Unicode Standard. A more meaningful way to think of it is: if you want to sell software in Europe, you better be able to input and display all the characters we Europeans have in this list. --Ken
Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)
On Friday, July 18, 2003 2:18 AM, Kenneth Whistler [EMAIL PROTECTED] wrote: MES-2 was not designed by the UTC, nor did it take any of these considerations into account. It is not really an appropriate construct for the Unicode Standard. A more meaningful way to think of it is: if you want to sell software in Europe, you better be able to input and display all the characters we Europeans have in this list. I interpret it like this way: MES-2 is a collection of characters independant of their actual encoding. To support MES-2 in a Unicode-compliant application, extra characters need to be added, notably if the minimum requirement for information interchange is the NFC form used by XML and HTML related standards. It would be interesting to inform CEN about how MES-2 can be documented to comply with all normative Unicode algorithms, and the minimum is to ensure the NFC closure of this subset, which should have better not included compatibility characters canonically decomposed to singleton decompositions, and should now reintegrate the missing NFC form. For obvious reasons, the case mappings should also be closed, but not necassarily compatibility decompositions, or characters needed for the NFD form (notably combining diacritics, which may be added only on applications that can process and recompose them on the when querying supported precomposed characters in fonts). Does the default TrueType fonts for Windows support the whole MES-2 repertoire (Times New Roman, Arial and Courrier New), including on Windows 95 without Uniscribe installed and used? In practice, MES-2 support will always need additional characters to ensure the minimum closures, and ISO10646 should work with CEN to fix their set in a revision. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Putting Unicode to Work
Although some list members may already be aware of these pages, because there are still very few web sites today that present text using a variety of Unicode ranges for purposes other than 'display testing', I thought the entire list should know about the Unicode pages on the Hot Peach Pages/EarthWords site: 1. Languages A to I (http://www.hotpeachpages.net/lang/indexu.html) 2. Languages J to Z (http://www.hotpeachpages.net/lang/index2u.html) 3. Quick definition of DV (http://www.hotpeachpages.net/lang/defnu.html) (There is also an information page about how to 'access' Unicode at http://www.hotpeachpages.net/a/characters.html.) This non-profit web site was intended as an international resource from its inception, and has recently incorporated Unicode in order to further that goal. Of the millions of web sites on the internet, it may in fact be 'the' one that currently makes the most use of Unicode ranges in furtherance of 'non-whimsical', 'non-technical' purposes. Any suggestions for improvement would be welcomed.