RE: FW: Using Unicode Characters in ASCII Streams
Asmus Freytag wrote: From: [EMAIL PROTECTED] [...] we are a manufacturer of time and attendance terminals which aretransfering data using 8-Bit character streams [...] Now here is my question: Is there a method to add any Unicode character to a 8-Bit ASCII stream? [...] There are three or four options for forcing Unicode into an 8-bit format. a) Use UTF-8. This preserves ASCII, but the characters 127 are different from Latin-1. [...] Of these four approaches, d) uses the least space, a) is the most widely supported in plain text files [...] All four require that the receiver can understand that format, but a) is considered one of the three equivalent Unicode Encoding Forms and therefore standard. I'd like to stress that this being standard implies that UTF-8 is supported out-of-the-box by many word processors and text editors, on many operating systems. This is important because, normally, the localized text messages to be sent to embedded terminals are contained in normal text files, prepared on a standard personal computer. Often, the person who physically edits the message is a free-lance translator who knows nothing about the technical details of the embedded terminal. So, sticking to UTF-8 may simplify the task of preparing and distributing localized message, E.g., when you want to go Russian, you just hire a Russian translator and ask him to please submit the files in UTF-8. If he has some expertise on text files, (s)he will need no further clarification, and send proper UTF-8. In the other case, if (s)he doesn't understand and submit the file in some other kind of standard encoding, you just pick up one of the many existing programs to convert encoding and turn the file to UTF-8. On the other hand, using proprietary formats also implies implementing proprietary utilities for the personal computer: text editors, viewers, converters, etc. Moreover, if UTF-8 support is to be inserted in the embedded terminal, it is easy to find the relevant code already implemented and tested in good C language. On the other hand, a proprietary format must be designed, implemented and tested from scratch. _ Marco
New documents
New WG2 documents are available: N2410 Revised proposal to encode the Limbu script in the UCS. http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2410.pdf (This document provides corrections to the Limbu set under ballot) N2411 Proposal to add two Greek letters for Bactrian to the UCS. http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2411.pdf -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Key E00 (was: (no subject))
At 02:24 -0500 2002-02-06, [EMAIL PROTECTED] wrote: ISO keyboards have the section-sign (§) key, next to the 1 key above the tab key on the left of the keyboards. Some US keyboards (for instance the Mac PowerBook G3) don't have this key, but instead have the grave key there, while on the ISO keyboard the grave key is down next to the z. -- Michael Everson *** Everson Typography *** http://www.evertype.com
FW: Bar codes using unicode
Found that somewhat old e-mail from Clive, but the web site is still there ... Good luck Arnold -Original Message- From: Hohberger, Clive [mailto:[EMAIL PROTECTED]] Sent: Friday, May 11, 2001 5:34 AM To: '[EMAIL PROTECTED]' Subject: Bar codes using unicode Speaking as a member of the AIM bar code standards committee, there are new two bar codes which support Unicode. 93i (designed by Sprague Ackey of Intermec) is a linear, error-correcting barcode has issue as an AIM International Technical Standard, and it encodes Unicode 2.0/2.1. For an overview, see: http://www.aimglobal.org/standards/symbinfo/93i_overview.htm Ultracode(r) and Color Ultracode (designed by me; Zebra Technologies Corporation) are 2-dimensional error-correcting symbologies in the AIM standards process. The Ultracode symbology is a constant-height, variable length two-dimensional linear matrix using 9-cell high x 2-cell wide tiles containing 283 different values (orignally was 47). Ultracode can encode either 8-bit, multi-byte or the full 21-bit Unicode 3-series character sets. Because of the unique way in which characters are encoded, there is little difference in symbol length when either 8-bit or Unicode encoding is used with either Latin or non-Latin characters such as Chinese, Japanese and Korean. UTF-8 is the default input/output. Black white Ultracode is scheduled for completion this year... Color Ultracode in 2002. Anyone wishing a copy of the current Ultracode draft spec should contact me offline ([EMAIL PROTECTED]) Clive
Answers about Unicode history
Here is a summary of all the answers I received to my historical questions. Sorry for the length of this post, but I think that many people will find this worth reading. Thanks again to all the people who took the time to reply. _ Marco --- --- --- --- Q: When did the Unicode project start, and who started it? A: [Magda Danish] I am currently working on a few web pages that talk about the Unicode history. A: [Mark Davis] While we will continue to flesh out and improve these pages, the initial versions are publicly available, under Historical Data on: http://www.unicode.org/unicode/consortium/consort.html A: [Kenneth Whistler] The short answer is that Joe Becker (Xerox) and Lee Collins (Apple) were highly instrumental in getting the ball rolling on this, and the preliminary work they did, primarily on Han unification, dated from 1987. However, the Unicode project had many beginnings -- many points where you could mark a milestone in its early development. And the Unicode Consortium celebrated a number of 10-year anniversaries, starting from 1998 and continuing through last year. A: [Joseph Becker] Don't forget Mark Davis (then of Apple), who was more than highly instrumental in getting the ball rolling! And, don't forget my Unicode '88 manifesto, which was the clear intentional inception of Unicode as a specific initiative. I drafted it in February 1988 after the enthusiastic reception of my Unicode proposal at Uniforum, its final draft being August 1988. Since the Consortium has in fact handed it out as marking the start of Unicode, I think its mention might be clarified in our official history, which currently says: September 1988 ... Becker later presents paper on Unicode to ISO WG2. A: [Nelson H.F. Beebe] I remember reading this article more than 15 years ago, and being impressed by the possibilities that it represented: @String{j-SCI-AMER = Scientific American} @Article{Becker:1984:MWP, author = Joseph D. Becker, title = Multilingual Word Processing, journal = j-SCI-AMER, volume = 251, number = 1, pages = 96--107, month = jul, year = 1984, CODEN = SCAMAC, ISSN = 0036-8733, bibdate = Tue Feb 18 10:44:43 MST 1997, bibsource = Compendex database, abstract = The advantages of computerized typing and editing are now being extended to all the living languages of the world. Even a complex script such as Japanese or Arabic be processed., acknowledgement = ack-nhfb # and # ack-rc, affiliationaddress = Xerox Office Systems Div, Palo Alto, CA, USA, classification = 723, journalabr = Sci Am, keywords = Character Sets; data processing; word processing,} It was followed up by this more formal one: @String{j-CACM = Communications of the ACM} @Article{Becker:1987:AWP, author = Joseph D. Becker, title = {Arabic} word processing, journal = j-CACM, volume = 30, number = 7, pages = 600--610, month = jul, year = 1987, CODEN = CACMA2, ISSN = 0001-0782, bibdate = Thu May 30 09:41:10 MDT 1996, bibsource = http://www.acm.org/pubs/toc/;, URL = http://www.acm.org/pubs/toc/Abstracts/0001-0782/28570.html;, acknowledgement = ack-nhfb, keywords = algorithms; design; documentation; human factors; measurement, review = ACM CR 8902-0084, subject = {\bf H.4.1}: Information Systems, INFORMATION SYSTEMS APPLICATIONS, Office Automation, Word processing. {\bf J.5}: Computer Applications, ARTS AND HUMANITIES, Linguistics. {\bf I.7.1}: Computing Methodologies, TEXT PROCESSING, Text Editing, Languages.,} The latter is not in unicode.bib, but will soon be. --- --- --- --- Q: Is it true Han Unification was the core of Unicode, and the idea of an universal encoding come afterwards? A: [Kenneth Whistler] The effort by Xerox and Apple to do a Han unification was key to the motivation that eventually led to a serious effort to actually *do* Unicode and then to establish the Unicode Consortium to standardize and promote it. However, the idea of a universal encoding predated that considerably. In some respects the Xerox Character Code Standard (XCCS) was a serious attempt at providing a universal character encoding (although it did not include a unified Han encoding, but only Japanese kanji). XCCS 2.0 (1980) contained, in addition to Japanese kanji: Latin (with IPA), Hiragana, Bopomofo, Katakana, Greek, Cyrillic, Runic, Gothic, Arabic, Hebrew, Georgian, Armenian, Devanagari, Hangul jamo, and a wide variety of symbols. The early Unicoders mined XCCS 2.0 heavily for the early drafts of Unicode 1.0, and always regarded it as the prototype for a universal encoding. Additionally, you have to consider that the beginning of the ISO project for a Multi-octet Universal Character Set (10646) predated the formal establishment of Unicode. Part of the impetus for the serious work to standardize Unicode was, of course, discontent with the then architecture of the early drafts of 10646. --- --- --- --- Q: Who and when invented the name Unicode? A: [Kenneth Whistler] This one has a definitive
RE: Bar codes using unicode
Thanks, Arnold. Good time for an update... The public domain specification (an AIM International Technical Standard)for black/white Ultracode(R) will be released sometime this year, probably around 4Q2002. We will have prototype encoding (UTF-8 to symbol graphic) and codeword-to-UTF-8 software available for anyone to try in 2Q2002. I'll send out notice of availability of the current draft spec and software to the Unicode list as soon as it is ready. Color Ultracode uses the same data encoding internal engine as monochrome Ultracode but 1x9 colored tiles rather than 2x9 B/W tiles. This spec will probably run about 6-9 months later. Both versions support UTF-8 input/output and can address all codepoints on Code Planes 0-16. Internally, the encoding has sophisticated compaction modes for decimal numerics and delimited numeric strings, 7-bit ASCII/ISO 646, a wide range of 8-bit character sets (such as the ISO 8859/x series, etc.; using 8-bit I/O), Unicode single-Row alphabetic languages, BMP and multiplanar CJKV encoding. It is a Reed-Solomon error-correcting barcode designed for rough service and high damage applications. It can be printed and direct marked using almost any technology, including a stencil and spray can! Several companies have expressed interest in developing imagaing scanners for it. Zebra Technologies will have it available in its bar code printers by the end of the year. Anyone with a potential application or who wants to do a field trial should contact me off-line. I'll be delighted to help! Cheers, Clive * Clive P Hohberger, PhD VP, Technology Development Director of Patent Affairs Zebra Technologies Corporation 333 Corporate Woods Parkway Vernon Hills IL 60061-3109 USA Voice: +1 847 793 2740 FAX:+1 847 793 5573 Cellular: +1 847 910 8794 E-mail: [EMAIL PROTECTED] -Original Message- From: Winkler, Arnold F [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 06, 2002 8:36 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: FW: Bar codes using unicode Found that somewhat old e-mail from Clive, but the web site is still there ... Good luck Arnold -Original Message- From: Hohberger, Clive [mailto:[EMAIL PROTECTED]] Sent: Friday, May 11, 2001 5:34 AM To: '[EMAIL PROTECTED]' Subject: Bar codes using unicode Speaking as a member of the AIM bar code standards committee, there are new two bar codes which support Unicode. 93i (designed by Sprague Ackey of Intermec) is a linear, error-correcting barcode has issue as an AIM International Technical Standard, and it encodes Unicode 2.0/2.1. For an overview, see: http://www.aimglobal.org/standards/symbinfo/93i_overview.htm Ultracode(r) and Color Ultracode (designed by me; Zebra Technologies Corporation) are 2-dimensional error-correcting symbologies in the AIM standards process. The Ultracode symbology is a constant-height, variable length two-dimensional linear matrix using 9-cell high x 2-cell wide tiles containing 283 different values (orignally was 47). Ultracode can encode either 8-bit, multi-byte or the full 21-bit Unicode 3-series character sets. Because of the unique way in which characters are encoded, there is little difference in symbol length when either 8-bit or Unicode encoding is used with either Latin or non-Latin characters such as Chinese, Japanese and Korean. UTF-8 is the default input/output. Black white Ultracode is scheduled for completion this year... Color Ultracode in 2002. Anyone wishing a copy of the current Ultracode draft spec should contact me offline ([EMAIL PROTECTED]) Clive
Re: A few questions about decomposition, equvalence and rendering
JC It's pretty much a given that a normalization form that meddles with JC plain ASCII text isn't going to get used. I had to think about it, but it does makes sense. JC The U+1Fxx ones are the spacing compatibility equivalents, Compatibility who with? Juliusz
Re: A few questions about decomposition, equvalence and rendering
Thanks a lot for the explanations. KW There is no good reason to invent composite combining marks KW involving two accents together. (In fact, there are good reasons KW *not* to do so.) The few that exist, e.g. U+0344, cause KW implementation problems and are discouraged from use. What are those problems? As long as they have canonical decompositions, won't such precomposed characters be discared at normalisation time, hopefully during I/O? (I'm not arguing in favour of precomposed characters; I'm just saying that my gut instinct is that we have to deal with normalisation anyway, and hence they don't complicate anything further; I'd be curious to hear why you think otherwise.) As far as I can tell, there is nothing in the Unicode database that relates a ``modifier letter'' to the associated punctuation mark. KW Correct. They are viewed as distinct classes. does anyone [have] a map from mathematical characters to the Geometric Shapes, Misc. symbols and Dingbats that would be useful for rendering? KW As opposed to the characters themselves? I'm not sure what you KW are getting at here. The user invokes a search for ``f o g'' (the composite of g with f), and she entered U+25CB WHITE CIRCLE. The document does contain the required formula, but encoded with U+2218 RING OPERATOR. The user's input was arguably incorrect, but I hope you'll agree that the search should match. I'm rendering a document that contains U+2218. The current font doesn't contain a glyph associated to this codepoint, but it has a perfectly good glyph for U+25CB. The rendering software should silently use the latter. Analogous examples can be made for the ``modifier letters''. I'll mention that I do understand why these are encoded separately[1], and I do understand why and how they will behave differently in a number of situations. I am merely noting that there are applications (useful-in-practice search, rendering) where they may be identified or at least related, and I am wondering whether people have already compiled the data necessary to do so. Thanks again, Juliusz [1] Offtopic: I have mixed feelings on the inclusion of STICS. On the one hand it's great to at last have a standardised encoding for math characters, on the other I feel it is based on very different encoding principles than the rest of Unicode.
Re: Key E00 (was: (no subject))
Apple calls what I have on my desk an ISO extended keyboard. It came with my Cube. It has the section key next to the 1, and the grave key next to the z. My Powerbook has the grave key next to the 1, and no key next to the z. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Cherokee accent
Here is the response I got from the Cherokee Nation, to whom I cc'd my original question about the Cherokee accent mark. So, is this a candidate for encoding? -8-begin forwarded message-8- The accent is to be used on the syllable with the accent when pronouncing the word, just like an accent is used in the pronunciation key of an English word. Thank you for your inquiry, LISA wadulisi Name: Lisa Stopp wadulisi dinalewisda Resource Coordinator for the Arts Cultural Resource Center Cherokee Nation PO Box 948, Tahlequah, OK 74465 918-458-6170 fax 918-458-6172 E-mail: Lisa Stopp [EMAIL PROTECTED] Date: 02/07/2002 Time: 10:19:54 -Doug Ewell Fullerton, California (address will soon change to dewell at adelphia dot net) ---BeginMessage--- The accent is to be used on the syllable with the accent when pronouncing the word, just like an accent is used in the pronunciation key of an English word. Thank you for your inquiry, LISA wadulisi Name: Lisa Stopp wadulisi dinalewisda Resource Coordinator for the Arts Cultural Resource Center Cherokee Nation PO Box 948, Tahlequah, OK 74465 918-458-6170 fax 918-458-6172 E-mail: Lisa Stopp [EMAIL PROTECTED] Date: 02/07/2002 Time: 10:19:54 ---End Message---
Re: Key E00 (was: (no subject))
In a message dated 2002-02-06 3:39:14 Pacific Standard Time, [EMAIL PROTECTED] writes: ISO keyboards have the section-sign (§) key, next to the 1 key above the tab key on the left of the keyboards. Some US keyboards (for instance the Mac PowerBook G3) don't have this key, but instead have the grave key there, while on the ISO keyboard the grave key is down next to the z. My draft copy of ISO/IEC 9995-3, acquired from: http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0233_9995-3.pdf shows SECTION SIGN on key C02, level 2 of the common secondary group, and GRAVE ACCENT on key C12, level 1 on both the complementary Latin and common secondary groups. (Note that C12 is frequently relocated to B00, down next to the 'z' as you indicated.) In the complementary Latin group, key E00 is ASTERISK (level 1) and PLUS SIGN (level 2), while in the common secondary group it is NOT SIGN (level 1) and SOFT HYPHEN (level 2). Which ISO keyboard are you referring to? I'm not trying to be argumentative; I just got done implementing a lot of keyboards, and none of them had SECTION SIGN on key E00, so I'm curious. For those unfamiliar with ISO 9995 terminology, please refer to the above document as well as: http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0232_9995-2.pdf and John Cowan's explanation from yesterday. -Doug Ewell Fullerton, California (address will soon change to dewell at adelphia dot net)
Re: Cherokee accent
At 12:01 -0500 2002-02-06, [EMAIL PROTECTED] wrote: Here is the response I got from the Cherokee Nation, to whom I cc'd my original question about the Cherokee accent mark. So, is this a candidate for encoding? I think I'll talk to her and see. -- Michael Everson *** Everson Typography *** http://www.evertype.com
New Unicode Encoding/Compression: BOCU-1
Hello, Mark Davis and I developed a concrete, MIME-friendly version of the BOCU algorithm that we presented earlier (http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html). We have a summary and spec with sample code at http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html BOCU-1: A MIME-compatible application of the Binary Ordered Compression for Unicode base algorithm. ... BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. It is useful for short strings and maintains code point order. ... stateful ... Feedback is welcome. Best regards, markus
Re: Unicode and Security
On Wednesday, February 6, 2002, at 11:12 AM, Lars Kristan wrote: Maybe digitally signed messages and bank accounts are not that good of an example, since people would be more careful there. Another case where this may get exploited will be domain names, once Unicode is allowed there. While www.example.com may be a company I trust, www.example.com with a Cyrillic 'a' in it may be a hacker (and no, I did not imply he/she would be from a county that uses Cyrillic) trying to get me to visit the site. Right, but right now is that people are typing things like www.whitehouse. com instead of www.whitehouse.gov (or, for that matter, www.unicode.com). How likely is it that someone will accidentally type www.s$B'Q(Bmple.com instead of www.sample.com? The original focus was on digital signatures, and I still don't get the objection. Because I don't know *precisely* what bytes Microsoft Word or Adobe Acrobat use, do I refuse to sign documents they create? Is that the idea? I mean, good heavens, I don't even know *precisely* what bytes Mail. app is going to use for this email. Should I refuse to sign it? == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
IUC20 talk Querying XML Documents by Paul Cotton Jonathan Robie
The slides from the IUC20 talk titled Querying XML Documents, given by Paul Cotton and Jonathan Robie, are now available at: http://www.w3.org/2002/01/xquery-unicode.pdf Misha Wolf -- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
RE: Unicode and Security
Well, I was tempted to join the discussion for a while now, but one of the things that stopped me was that I didn't quite understand why it was so focused on the bidi stuff. To make a certain portion of the text look like something else should be easier than that. OK, invisible non-spacing glyphs would be just one more method, I guess. I was thinking of replacing some characters with their look-alikes (probably even rendered from the same data in a font), like using U+0430 instead of U+0061 (Cyrillic 'a' instead of Latin 'a'). Maybe digitally signed messages and bank accounts are not that good of an example, since people would be more careful there. Another case where this may get exploited will be domain names, once Unicode is allowed there. While www.example.com may be a company I trust, www.example.com with a Cyrillic 'a' in it may be a hacker (and no, I did not imply he/she would be from a county that uses Cyrillic) trying to get me to visit the site. Yes, it's a fraud. And I want to thank John for pointing that out. But we're making it a hell of a lot easier now. In ASCII, all one could try was www.examp1e.com and a couple of other tricks, but it was maybe 10 tricks in ASCII, some more in case of Latin 1. How many are there with Unicode? U, a million? Well, nothing wrong with Unicode of course. Just means that there will need to be an option in your browser to reject any site without a digital certificate, and perhaps it will need to be turned on by default. So, there are ways to fight this (and I am afraid relying on police will not do it), but maybe these things should be well in place before someone gets a chance to exploit the new ways. Just a thought. Regards, Lars -Original Message- From: John Hudson [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 06, 2002 01:54 To: Unicode List Subject: Re: Unicode and Security At 09:39 2/5/2002, John H. Jenkins wrote: Y'know, I must confess to not following this thread at all. Yes, it is impossible to tell from the glyphs on the screen what sequence of Unicode characters was used to generate them. Just *how*, exactly, is this a security problem? I was wondering the same thing. I can make an OpenType font for that uses contextual substitution to replace the phrase 'The licensee also agrees to pay the type designer $10,000 every time he uses the lowercase e' with a series of invisible non-spacing glyphs. Of course, the backing store will contain my dastardly hidden clause and that is the text the unwitting victim will electronically sign. Hahahaha, he laughed maniacally! This has nothing to do with encoding, does not rely on difficult and totally improbable manipulation of a bidirectional algorithm and, most relevantly, is *not* a security problem in the OpenType font specification. It is an example of fraud. I suppose if there was a software solution to all such dangers, we wouldn't need police, felony charges, the court system, prisons, or any of the other things we rely on to protect honest people against dishonest. John Hudson Tiro Typeworkswww.tiro.com Vancouver, BC [EMAIL PROTECTED] ... es ist ein unwiederbringliches Bild der Vergangenheit, das mit jeder Gegenwart zu verschwinden droht, die sich nicht in ihm gemeint erkannte. ... every image of the past that is not recognized by the present as one of its own concerns threatens to disappear irretrievably. Walter Benjamin
Re: A few questions about decomposition, equvalence and rendering
Juliusz continued: KW There is no good reason to invent composite combining marks KW involving two accents together. (In fact, there are good reasons KW *not* to do so.) The few that exist, e.g. U+0344, cause KW implementation problems and are discouraged from use. What are those problems? As long as they have canonical decompositions, won't such precomposed characters be discared at normalisation time, hopefully during I/O? (I'm not arguing in favour of precomposed characters; I'm just saying that my gut instinct is that we have to deal with normalisation anyway, and hence they don't complicate anything further; I'd be curious to hear why you think otherwise.) Perhaps I overstated the case slightly. It is true enough that if you are working with normalized data, U+0344 gets normalized away: % egrep 0344 NormalizationTest-3.2.0d6.txt 0344;0308 0301;0308 0301;0308 0301;0308 0301; # ... COMBINING GREEK DIALYTIKA TONOS and you just end up with an otherwise typical sequence of combining marks. However, the complication is in the statement of the algorithm, where you end up having to talk about (and include in your tables) the Non-Starter Decompositions. See CompositionExclusions.txt, which has a special section mentioning just these four oddballs: # # (4) Non-Starter Decompositions # These characters can be derived from the UnicodeData file # by including all characters whose canonical decomposition consists # of a sequence of characters, the first of which has a non-zero # combining class. # These characters are simply quoted here for reference. # # 0344 COMBINING GREEK DIALYTIKA TONOS # 0F73 TIBETAN VOWEL SIGN II # 0F75 TIBETAN VOWEL SIGN UU # 0F81 TIBETAN VOWEL SIGN REVERSED II Note also that all four of these characters get use of this character is discouraged notes in the Unicode names list. These characters also result in a problematical edge case for processing of the tables for the Unicode Collation Algorithm to provide proper weightings. does anyone [have] a map from mathematical characters to the Geometric Shapes, Misc. symbols and Dingbats that would be useful for rendering? KW As opposed to the characters themselves? I'm not sure what you KW are getting at here. The user invokes a search for ``f o g'' (the composite of g with f), and she entered U+25CB WHITE CIRCLE. The document does contain the required formula, but encoded with U+2218 RING OPERATOR. The user's input was arguably incorrect, but I hope you'll agree that the search should match. I'm rendering a document that contains U+2218. The current font doesn't contain a glyph associated to this codepoint, but it has a perfectly good glyph for U+25CB. The rendering software should silently use the latter. Analogous examples can be made for the ``modifier letters''. I'll mention that I do understand why these are encoded separately[1], and I do understand why and how they will behave differently in a number of situations. I am merely noting that there are applications (useful-in-practice search, rendering) where they may be identified or at least related, and I am wondering whether people have already compiled the data necessary to do so. I don't think so -- at least not officially within the Unicode Consortium. This is concerned with shape similarities that go beyond the kind of character folding implicit in the Unicode Collation Algorithm. The Unicode names list provides a considerable number of cross-references for similarly-shaped characters and confusables, but this is, of course, far short of a detailed listing that could be used as the basis of a specification for shaped-based folding for search purposes. --Ken
RE: Unicode and Security
Well, nothing wrong with Unicode of course. Just means that there will need to be an option in your browser to reject any site without a digital certificate, and perhaps it will need to be turned on by default. So, Nothing prevents sites running frauds to get a certificate matching their name. If the price of certificates drop, or if the fraud has good margins enough, it will not even be a big inconvenience. YA
Re: Unicode and Security
On Wed, Feb 06, 2002 at 07:12:19PM +0100, Lars Kristan wrote: Well, I was tempted to join the discussion for a while now, but one of the things that stopped me was that I didn't quite understand why it was so focused on the bidi stuff. Because it can have a dramatic effect, whereas changing look-alikes has no effect on the displayed text. Yes, it's a fraud. And I want to thank John for pointing that out. But we're making it a hell of a lot easier now. In ASCII, all one could try was www.examp1e.com and a couple of other tricks, but it was maybe 10 tricks in ASCII, some more in case of Latin 1. How many are there with Unicode? U, a million? How often does it matter? I can see registars not registering stuff that was obviously an attempt to defraud, but you won't get there if you type it in yourself. It's easier for someone to set up a forged Microsoft link, but it's easy to check that. Rather than everyone being digitally signed, just checking if it's multiscript and pop up a warning will catch most of the cases. You could colorcode the major scripts with confusables . . . -- David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber) Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc.
Re: Unicode and Security
At 11:54 AM 2/6/2002 -0700, John H. Jenkins wrote: The original focus was on digital signatures, and I still don't get the objection. Because I don't know *precisely* what bytes Microsoft Word or Adobe Acrobat use, do I refuse to sign documents they create? Is that the idea? I mean, good heavens, I don't even know *precisely* what bytes Mail. app is going to use for this email. Should I refuse to sign it? I don't think the main issue is whether or not you should sign it. I think the main issue the original poster tired to raise, is that as the recipient of such a signed document, he is not persuaded he should trust it. This is a serious issue, although as several have noted, not a Unicode-only one. No one doubts the security of the encryption algorithms used for signing. But the issue of trust is critical. In the analog world, people are expected read and understand documents, and in general, the worlds legal systems are set up to recognize that a signature (or stamp or seal or whatever) is binding evidence that such care was taken (even if it wasn't really taken). In the digital world, individual behavior and legal processes both may not be so well formed to support the technology of digital signatures. I believe this is what the original point was. IANAL, but enforceability of such a kluged, digitally-signed document seems in doubt. There is a long history of that type of contract support in our US legal systems, and probably others as well. There will surely be difficulties adapting it to the digital domain, but I think the basis for support is already there Anyway, it is not, but maybe should be well known, that the purpose of digital signatures, is to verify who the sender is, and to verify that the document has not been changed in transit. That it might contain tricky language or information is an important thing to note, but the reader still needs to rely on the document's contents with the same skeptical eye as if it were not printed. Just as the Unicode bi-di algorithm makes no claims at reversibility, digital signing algorithms make no claim that the signed contents are correct,or even useful.
Oops!
The ALA/LC romanization tables ar at: lcweb.loc.gov/catdir/cpso/roman.html ( not .../romanization.html as in my earlier note) Sorry, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
ALA/LC Romanization Tables on the Web
Wednesday, February 6, 2002 The scanned pages of the 1997 ALA/LC romanization tables are now available on the Web: http://lcweb.loc.gov/catdir/cpso/romanization.html Note that in lieu of the Wade Giles pages there is a note that pinyin guidelines are pending. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams. Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.
Re: A few questions about decomposition, equvalence and rendering
-BEGIN PGP SIGNED MESSAGE- Kenneth Whistler wrote: ... See CompositionExclusions.txt, which has a special section mentioning just these four oddballs: # # (4) Non-Starter Decompositions # These characters can be derived from the UnicodeData file # by including all characters whose canonical decomposition consists # of a sequence of characters, the first of which has a non-zero # combining class. Shouldn't that say, a sequence of two characters? Taken literally this definition includes characters with a canonical decomposition that is a single combining character. (To forestall the obvious objection, no, using the plural does not imply more than one: any decomposition is a sequence of characters.) - -- David Hopwood [EMAIL PROTECTED] Home page PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -BEGIN PGP SIGNATURE- Version: 2.6.3i Charset: noconv iQEVAwUBPGCgFTkCAxeYt5gVAQFdYAf+JOHLD7dAfZgPT7vAid+Ttt9ojgR3dMUv tkxu7pC1eqx0h1u9yBkwv42S7r3M41ha6dvwCrlKlxPT1H8nPj+CWP4nhRcWeDxF 8fK+Plk0FxmIedksAXL1vbPbCI5Vf36/O3OFN++oLurGdf+DuA1lZ0WC191njW6V /+rqRjCPKwSz8UiftLrF9EApjHaSwHH5skO9OZIrbocsfGU44pl3SsJIB0HsjxU4 GAp+HbABJ+67EDH8KtUAa0lHEBKHRoC4a1KWLuFV7E1uLCGH8X2fVbAOYX/jIHEU I8W9gJDebquu/Vnph3AIlW9MVO1hALWqB80ngZtHBYDbkT9zSXRRQg== =+vHx -END PGP SIGNATURE-