RE: FW: Using Unicode Characters in ASCII Streams

2002-02-06 Thread Marco Cimarosti

Asmus Freytag wrote:
  From: [EMAIL PROTECTED]
[...]
  we are a manufacturer of time and attendance terminals which 
   aretransfering data using 8-Bit character streams
[...]
  Now here is my question: Is there a method to add any 
  Unicode character to a 8-Bit ASCII stream?
[...]
 
 There are three or four options for forcing Unicode into an 
 8-bit format.
 
 a) Use UTF-8. This preserves ASCII, but the characters 127 
 are different 
 from Latin-1.
 
[...]
 
 Of these four approaches, d) uses the least space, a) is the 
 most widely supported in plain text files [...]
 
 All four require that the receiver can understand that 
 format, but a) is considered one of the three
 equivalent Unicode Encoding Forms and therefore standard.

I'd like to stress that this being standard implies that UTF-8 is supported
out-of-the-box by many word processors and text editors, on many operating
systems.

This is important because, normally, the localized text messages to be sent
to embedded terminals are contained in normal text files, prepared on a
standard personal computer. Often, the person who physically edits the
message is a free-lance translator who knows nothing about the technical
details of the embedded terminal.

So, sticking to UTF-8 may simplify the task of preparing and distributing
localized message,

E.g., when you want to go Russian, you just hire a Russian translator and
ask him to please submit the files in UTF-8. If he has some expertise on
text files, (s)he will need no further clarification, and send proper UTF-8.
In the other case, if (s)he doesn't understand and submit the file in some
other kind of standard encoding, you just pick up one of the many existing
programs to convert encoding and turn the file to UTF-8.

On the other hand, using proprietary formats also implies implementing
proprietary utilities for the personal computer: text editors, viewers,
converters, etc.

Moreover, if UTF-8 support is to be inserted in the embedded terminal, it is
easy to find the relevant code already implemented and tested in good C
language. On the other hand, a proprietary format must be designed,
implemented and tested from scratch.

_ Marco




New documents

2002-02-06 Thread Michael Everson

New WG2 documents are available:

N2410 Revised proposal to encode the Limbu script in the UCS.
   http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2410.pdf
(This document provides corrections to the Limbu set under ballot)

N2411 Proposal to add two Greek letters for Bactrian to the UCS.
   http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2411.pdf

-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Key E00 (was: (no subject))

2002-02-06 Thread Michael Everson

At 02:24 -0500 2002-02-06, [EMAIL PROTECTED] wrote:

ISO keyboards have the section-sign (§) key, next to the 1 key 
above the tab key on the left of the keyboards. Some US keyboards 
(for instance the Mac PowerBook G3) don't have this key, but instead 
have the grave key there, while on the ISO keyboard the grave key 
is down next to the z.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




FW: Bar codes using unicode

2002-02-06 Thread Winkler, Arnold F

Found that somewhat old e-mail from Clive, but the web site is still there
...
Good luck  
Arnold

-Original Message-
From: Hohberger, Clive [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 11, 2001 5:34 AM
To: '[EMAIL PROTECTED]'
Subject: Bar codes using unicode


Speaking as a member of the AIM bar code standards committee, there are new
two bar codes which support Unicode.

93i (designed by Sprague Ackey of Intermec) is a linear, error-correcting
barcode has issue as an AIM International Technical Standard, and it encodes
Unicode 2.0/2.1. For an overview, see:
http://www.aimglobal.org/standards/symbinfo/93i_overview.htm

Ultracode(r) and Color Ultracode (designed by me; Zebra Technologies
Corporation) are 2-dimensional error-correcting symbologies in the AIM
standards process. The Ultracode symbology is a constant-height, variable
length two-dimensional linear matrix using 9-cell high x 2-cell wide tiles
containing 283 different values (orignally was 47).  Ultracode can encode
either 8-bit, multi-byte or the full 21-bit Unicode 3-series character sets.
Because of the unique way in which characters are encoded, there is little
difference in symbol length when either 8-bit or Unicode encoding is used
with either Latin or non-Latin characters such as Chinese, Japanese and
Korean. UTF-8 is the default input/output. Black  white Ultracode is
scheduled for completion this year... Color Ultracode in 2002.

Anyone wishing a copy of the current Ultracode draft spec should contact me
offline ([EMAIL PROTECTED]) 
Clive




Answers about Unicode history

2002-02-06 Thread Marco Cimarosti

Here is a summary of all the answers I received to my historical
questions.

Sorry for the length of this  post, but I think that many people will find
this worth reading. Thanks again to all the people who took the time to
reply.

_ Marco


--- --- --- ---
Q: When did the Unicode project start, and who started it?

A: [Magda Danish]
I am currently working on a few web pages that talk about the Unicode
history.

A: [Mark Davis]
While we will continue to flesh out and improve these pages, the initial
versions are publicly  available, under Historical Data on:
http://www.unicode.org/unicode/consortium/consort.html

A: [Kenneth Whistler]
The short answer is that Joe Becker (Xerox) and Lee Collins (Apple) were
highly instrumental in  getting the ball rolling on this, and the
preliminary work they did, primarily on Han unification,  dated from 1987.
However, the Unicode project had many beginnings -- many points where you
could mark a milestone in its early development. And the Unicode Consortium
celebrated a number of  10-year anniversaries, starting from 1998 and
continuing through last year.

A: [Joseph Becker]
Don't forget Mark Davis (then of Apple), who was more than highly
instrumental in getting the ball  rolling!
And, don't forget my Unicode '88 manifesto, which was the clear
intentional inception of  Unicode as a specific initiative. I drafted it in
February 1988 after the enthusiastic reception of  my Unicode proposal at
Uniforum, its final draft being August 1988. Since the Consortium has in
fact handed it out as marking the start of Unicode, I think its mention
might be clarified in our  official history, which currently says:
September 1988 ... Becker later presents paper on Unicode  to ISO WG2.

A: [Nelson H.F. Beebe]
I remember reading this article more than 15 years ago, and being impressed
by the possibilities  that it represented:
@String{j-SCI-AMER = Scientific American}
@Article{Becker:1984:MWP,
author = Joseph D. Becker,
title = Multilingual Word Processing,
journal = j-SCI-AMER,
volume = 251,
number = 1,
pages = 96--107,
month = jul,
year = 1984,
CODEN = SCAMAC,
ISSN = 0036-8733,
bibdate = Tue Feb 18 10:44:43 MST 1997,
bibsource = Compendex database,
abstract = The advantages of computerized typing and editing are now being
extended to all the  living languages of the world. Even a complex script
such as Japanese or Arabic be processed.,
acknowledgement = ack-nhfb #  and  # ack-rc,
affiliationaddress = Xerox Office Systems Div, Palo Alto, CA, USA,
classification = 723,
journalabr = Sci Am,
keywords = Character Sets; data processing; word processing,}
It was followed up by this more formal one:
@String{j-CACM = Communications of the ACM}
@Article{Becker:1987:AWP,
author = Joseph D. Becker,
title = {Arabic} word processing,
journal = j-CACM,
volume = 30,
number = 7,
pages = 600--610,
month = jul,
year = 1987,
CODEN = CACMA2,
ISSN = 0001-0782,
bibdate = Thu May 30 09:41:10 MDT 1996,
bibsource = http://www.acm.org/pubs/toc/;,
URL = http://www.acm.org/pubs/toc/Abstracts/0001-0782/28570.html;,
acknowledgement = ack-nhfb,
keywords = algorithms; design; documentation; human factors; measurement,
review = ACM CR 8902-0084,
subject = {\bf H.4.1}: Information Systems, INFORMATION SYSTEMS
APPLICATIONS, Office Automation,  Word processing. {\bf J.5}: Computer
Applications, ARTS AND HUMANITIES, Linguistics. {\bf I.7.1}:  Computing
Methodologies, TEXT PROCESSING, Text Editing, Languages.,}
The latter is not in unicode.bib, but will soon be.


--- --- --- ---
Q: Is it true Han Unification was the core of Unicode, and the idea of an
universal encoding come  afterwards?

A: [Kenneth Whistler]
The effort by Xerox and Apple to do a Han unification was key to the
motivation that eventually led  to a serious effort to actually *do* Unicode
and then to establish the Unicode Consortium to  standardize and promote it.
However, the idea of a universal encoding predated that considerably.  In
some respects the Xerox Character Code Standard (XCCS) was a serious attempt
at providing a  universal character encoding (although it did not include a
unified Han encoding, but only Japanese  kanji). XCCS 2.0 (1980) contained,
in addition to Japanese kanji: Latin (with IPA), Hiragana,  Bopomofo,
Katakana, Greek, Cyrillic, Runic, Gothic, Arabic, Hebrew, Georgian,
Armenian, Devanagari,  Hangul jamo, and a wide variety of symbols. The early
Unicoders mined XCCS 2.0 heavily for the  early drafts of Unicode 1.0, and
always regarded it as the prototype for a universal encoding.
Additionally, you have to consider that the beginning of the ISO
project for a Multi-octet  Universal Character Set (10646) predated the
formal establishment of Unicode. Part of the impetus  for the serious work
to standardize Unicode was, of course, discontent with the then architecture
of the early drafts of 10646.


--- --- --- ---
Q: Who and when invented the name Unicode?

A: [Kenneth Whistler]
This one has a definitive 

RE: Bar codes using unicode

2002-02-06 Thread Hohberger, Clive

Thanks, Arnold.
Good time for an update...

The public domain specification (an AIM International Technical Standard)for
black/white Ultracode(R) will be released sometime this year, probably
around 4Q2002. We will have prototype encoding (UTF-8 to symbol graphic) and
codeword-to-UTF-8 software available for anyone to try in 2Q2002. I'll send
out notice of availability of the current draft spec and software to the
Unicode list as soon as it is ready.

Color Ultracode uses the same data encoding internal engine as monochrome
Ultracode but 1x9 colored tiles rather than 2x9 B/W tiles. This spec will
probably run about 6-9 months later.

Both versions support UTF-8 input/output and can address all codepoints on
Code Planes 0-16. Internally, the encoding has sophisticated compaction
modes for decimal numerics and delimited numeric strings, 7-bit ASCII/ISO
646, a wide range of 8-bit character sets (such as the ISO 8859/x series,
etc.; using 8-bit I/O), Unicode single-Row alphabetic languages, BMP and
multiplanar CJKV encoding.

It is a Reed-Solomon error-correcting barcode designed for rough service and
high damage applications. It can be printed and direct marked using almost
any technology, including a stencil and spray can! Several companies have
expressed interest in developing imagaing scanners for it. Zebra
Technologies will have it available in its bar code printers by the end of
the year.

Anyone with a potential application or who wants to do a field trial should
contact me off-line. I'll be delighted to help!
Cheers,
Clive


*
Clive P Hohberger, PhD
VP, Technology Development
   Director of Patent Affairs
Zebra Technologies Corporation
333 Corporate Woods Parkway
Vernon Hills IL 60061-3109 USA

Voice:  +1 847 793 2740
FAX:+1 847 793 5573
Cellular:   +1 847 910 8794
E-mail: [EMAIL PROTECTED]




-Original Message-
From: Winkler, Arnold F [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 06, 2002 8:36 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: FW: Bar codes using unicode


Found that somewhat old e-mail from Clive, but the web site is still there
...
Good luck  
Arnold

-Original Message-
From: Hohberger, Clive [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 11, 2001 5:34 AM
To: '[EMAIL PROTECTED]'
Subject: Bar codes using unicode


Speaking as a member of the AIM bar code standards committee, there are new
two bar codes which support Unicode.

93i (designed by Sprague Ackey of Intermec) is a linear, error-correcting
barcode has issue as an AIM International Technical Standard, and it encodes
Unicode 2.0/2.1. For an overview, see:
http://www.aimglobal.org/standards/symbinfo/93i_overview.htm

Ultracode(r) and Color Ultracode (designed by me; Zebra Technologies
Corporation) are 2-dimensional error-correcting symbologies in the AIM
standards process. The Ultracode symbology is a constant-height, variable
length two-dimensional linear matrix using 9-cell high x 2-cell wide tiles
containing 283 different values (orignally was 47).  Ultracode can encode
either 8-bit, multi-byte or the full 21-bit Unicode 3-series character sets.
Because of the unique way in which characters are encoded, there is little
difference in symbol length when either 8-bit or Unicode encoding is used
with either Latin or non-Latin characters such as Chinese, Japanese and
Korean. UTF-8 is the default input/output. Black  white Ultracode is
scheduled for completion this year... Color Ultracode in 2002.

Anyone wishing a copy of the current Ultracode draft spec should contact me
offline ([EMAIL PROTECTED]) 
Clive




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

JC It's pretty much a given that a normalization form that meddles with
JC plain ASCII text isn't going to get used.

I had to think about it, but it does makes sense.

JC The U+1Fxx ones are the spacing compatibility equivalents,

Compatibility who with?

Juliusz




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Juliusz Chroboczek

Thanks a lot for the explanations.

KW There is no good reason to invent composite combining marks
KW involving two accents together. (In fact, there are good reasons
KW *not* to do so.) The few that exist, e.g. U+0344, cause
KW implementation problems and are discouraged from use.

What are those problems?  As long as they have canonical
decompositions, won't such precomposed characters be discared at
normalisation time, hopefully during I/O?

(I'm not arguing in favour of precomposed characters; I'm just saying
that my gut instinct is that we have to deal with normalisation
anyway, and hence they don't complicate anything further; I'd be
curious to hear why you think otherwise.)

 As far as I can tell, there is nothing in the Unicode database that
 relates a ``modifier letter'' to the associated punctuation mark.

KW Correct. They are viewed as distinct classes.

 does anyone [have] a map from mathematical characters to the
 Geometric Shapes, Misc. symbols and Dingbats that would be useful
 for rendering?

KW As opposed to the characters themselves? I'm not sure what you
KW are getting at here.

The user invokes a search for ``f o g'' (the composite of g with f),
and she entered U+25CB WHITE CIRCLE.  The document does contain the
required formula, but encoded with U+2218 RING OPERATOR.  The user's
input was arguably incorrect, but I hope you'll agree that the search
should match.

I'm rendering a document that contains U+2218.  The current font
doesn't contain a glyph associated to this codepoint, but it has a
perfectly good glyph for U+25CB.  The rendering software should
silently use the latter.

Analogous examples can be made for the ``modifier letters''.

I'll mention that I do understand why these are encoded separately[1],
and I do understand why and how they will behave differently in a
number of situations.  I am merely noting that there are applications
(useful-in-practice search, rendering) where they may be identified or
at least related, and I am wondering whether people have already
compiled the data necessary to do so.

Thanks again,

Juliusz

[1] Offtopic: I have mixed feelings on the inclusion of STICS.  On the
one hand it's great to at last have a standardised encoding for math
characters, on the other I feel it is based on very different encoding
principles than the rest of Unicode.




Re: Key E00 (was: (no subject))

2002-02-06 Thread Michael Everson

Apple calls what I have on my desk an ISO extended keyboard. It came 
with my Cube. It has the section key next to the 1, and the grave key 
next to the z. My Powerbook has the grave key next to the 1, and no 
key next to the z.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Cherokee accent

2002-02-06 Thread DougEwell2

Here is the response I got from the Cherokee Nation, to whom I cc'd my 
original question about the Cherokee accent mark.

So, is this a candidate for encoding?

-8-begin forwarded message-8-

  The accent is to be used on the syllable with the accent when pronouncing 
  the word, just like an accent is used in the pronunciation key of an 
English 
 
  word.
  
  Thank you for your inquiry,
  LISA
  wadulisi
  
  Name: Lisa Stopp  wadulisi dinalewisda
  Resource Coordinator for the Arts
  Cultural Resource Center
  Cherokee Nation
  PO Box 948, Tahlequah, OK 74465
  918-458-6170
  fax 918-458-6172
  E-mail: Lisa Stopp [EMAIL PROTECTED]
  Date: 02/07/2002
  Time: 10:19:54

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)

---BeginMessage---

The accent is to be used on the syllable with the accent when pronouncing 
the word, just like an accent is used in the pronunciation key of an English 
word.

Thank you for your inquiry,
LISA
wadulisi

Name: Lisa Stopp  wadulisi dinalewisda
Resource Coordinator for the Arts
Cultural Resource Center
Cherokee Nation
PO Box 948, Tahlequah, OK 74465
918-458-6170
fax 918-458-6172
E-mail: Lisa Stopp [EMAIL PROTECTED]
Date: 02/07/2002
Time: 10:19:54





---End Message---


Re: Key E00 (was: (no subject))

2002-02-06 Thread DougEwell2

In a message dated 2002-02-06 3:39:14 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 ISO keyboards have the section-sign (§) key, next to the 1 key 
 above the tab key on the left of the keyboards. Some US keyboards 
 (for instance the Mac PowerBook G3) don't have this key, but instead 
 have the grave key there, while on the ISO keyboard the grave key 
 is down next to the z.

My draft copy of ISO/IEC 9995-3, acquired from:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0233_9995-3.pdf

shows SECTION SIGN on key C02, level 2 of the common secondary group, and 
GRAVE ACCENT on key C12, level 1 on both the complementary Latin and common 
secondary groups.  (Note that C12 is frequently relocated to B00, down next 
to the 'z' as you indicated.)

In the complementary Latin group, key E00 is ASTERISK (level 1) and PLUS SIGN 
(level 2), while in the common secondary group it is NOT SIGN (level 1) and 
SOFT HYPHEN (level 2).

Which ISO keyboard are you referring to?  I'm not trying to be 
argumentative; I just got done implementing a lot of keyboards, and none of 
them had SECTION SIGN on key E00, so I'm curious.

For those unfamiliar with ISO 9995 terminology, please refer to the above 
document as well as:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0232_9995-2.pdf

and John Cowan's explanation from yesterday.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




Re: Cherokee accent

2002-02-06 Thread Michael Everson

At 12:01 -0500 2002-02-06, [EMAIL PROTECTED] wrote:
Here is the response I got from the Cherokee Nation, to whom I cc'd my
original question about the Cherokee accent mark.

So, is this a candidate for encoding?

I think I'll talk to her and see.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




New Unicode Encoding/Compression: BOCU-1

2002-02-06 Thread Markus Scherer

Hello,

Mark Davis and I developed a concrete, MIME-friendly version of the BOCU algorithm 
that we presented earlier 
(http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html).

We have a summary and spec with sample code at 
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html

 BOCU-1:
 A MIME-compatible application of the
 Binary Ordered Compression for Unicode base algorithm.

 ... BOCU-1 combines the wide applicability of UTF-8
 with the compactness of SCSU.
 It is useful for short strings and
 maintains code point order. ... stateful ...

Feedback is welcome.

Best regards,
markus





Re: Unicode and Security

2002-02-06 Thread John H. Jenkins

On Wednesday, February 6, 2002, at 11:12 AM, Lars Kristan wrote:

 Maybe digitally signed messages and bank accounts are not that good of an
 example, since people would be more careful there. Another case where this
 may get exploited will be domain names, once Unicode is allowed there. 
 While
 www.example.com may be a company I trust, www.example.com with a Cyrillic
 'a' in it may be a hacker (and no, I did not imply he/she would be from a
 county that uses Cyrillic) trying to get me to visit the site.


Right, but right now is that people are typing things like www.whitehouse.
com instead of www.whitehouse.gov (or, for that matter, www.unicode.com).  
How likely is it that someone will accidentally type www.s$B'Q(Bmple.com 
instead of www.sample.com?

The original focus was on digital signatures, and I still don't get the 
objection.  Because I don't know *precisely* what bytes Microsoft Word or 
Adobe Acrobat use, do I refuse to sign documents they create?  Is that the 
idea?  I mean, good heavens, I don't even know *precisely* what bytes Mail.
app is going to use for this email.  Should I refuse to sign it?

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/


IUC20 talk Querying XML Documents by Paul Cotton Jonathan Robie

2002-02-06 Thread Misha . Wolf

The slides from the IUC20 talk titled Querying XML Documents,
given by Paul Cotton and Jonathan Robie, are now available at:
   http://www.w3.org/2002/01/xquery-unicode.pdf

Misha Wolf






-- --
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.




RE: Unicode and Security

2002-02-06 Thread Lars Kristan

Well, I was tempted to join the discussion for a while now, but one of the
things that stopped me was that I didn't quite understand why it was so
focused on the bidi stuff.

To make a certain portion of the text look like something else should be
easier than that. OK, invisible non-spacing glyphs would be just one more
method, I guess. I was thinking of replacing some characters with their
look-alikes (probably even rendered from the same data in a font), like
using U+0430 instead of U+0061 (Cyrillic 'a' instead of Latin 'a').

Maybe digitally signed messages and bank accounts are not that good of an
example, since people would be more careful there. Another case where this
may get exploited will be domain names, once Unicode is allowed there. While
www.example.com may be a company I trust, www.example.com with a Cyrillic
'a' in it may be a hacker (and no, I did not imply he/she would be from a
county that uses Cyrillic) trying to get me to visit the site.

Yes, it's a fraud. And I want to thank John for pointing that out. But we're
making it a hell of a lot easier now. In ASCII, all one could try was
www.examp1e.com and a couple of other tricks, but it was maybe 10 tricks in
ASCII, some more in case of Latin 1. How many are there with Unicode? U,
a million?

Well, nothing wrong with Unicode of course. Just means that there will need
to be an option in your browser to reject any site without a digital
certificate, and perhaps it will need to be turned on by default. So, there
are ways to fight this (and I am afraid relying on police will not do it),
but maybe these things should be well in place before someone gets a chance
to exploit the new ways.


Just a thought.


Regards,

Lars


 -Original Message-
 From: John Hudson [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, February 06, 2002 01:54
 To: Unicode List
 Subject: Re: Unicode and Security
 
 
 At 09:39 2/5/2002, John H. Jenkins wrote:
 
 Y'know, I must confess to not following this thread at all.  
 Yes, it is 
 impossible to tell from the glyphs on the screen what 
 sequence of Unicode 
 characters  was used to generate them.  Just *how*, exactly, 
 is this a 
 security problem?
 
 I was wondering the same thing.
 
 I can make an OpenType font for that uses contextual substitution to 
 replace the phrase 'The licensee also agrees to pay the type designer 
 $10,000 every time he uses the lowercase e' with a series of 
 invisible 
 non-spacing glyphs. Of course, the backing store will contain 
 my dastardly 
 hidden clause and that is the text the unwitting victim will 
 electronically 
 sign. Hahahaha, he laughed maniacally!
 
 This has nothing to do with encoding, does not rely on difficult and 
 totally improbable manipulation of a bidirectional algorithm 
 and, most 
 relevantly, is *not* a security problem in the OpenType font 
 specification. 
 It is an example of fraud. I suppose if there was a software 
 solution to 
 all such dangers, we wouldn't need police, felony charges, the court 
 system, prisons, or any of the other things we rely on to 
 protect honest 
 people against dishonest.
 
 John Hudson
 
 Tiro Typeworkswww.tiro.com
 Vancouver, BC [EMAIL PROTECTED]
 
 ... es ist ein unwiederbringliches Bild der Vergangenheit,
 das mit jeder Gegenwart zu verschwinden droht, die sich
 nicht in ihm gemeint erkannte.
 
 ... every image of the past that is not recognized by the
 present as one of its own concerns threatens to disappear
 irretrievably.
Walter Benjamin
 




Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread Kenneth Whistler

Juliusz continued:

 KW There is no good reason to invent composite combining marks
 KW involving two accents together. (In fact, there are good reasons
 KW *not* to do so.) The few that exist, e.g. U+0344, cause
 KW implementation problems and are discouraged from use.
 
 What are those problems?  As long as they have canonical
 decompositions, won't such precomposed characters be discared at
 normalisation time, hopefully during I/O?
 
 (I'm not arguing in favour of precomposed characters; I'm just saying
 that my gut instinct is that we have to deal with normalisation
 anyway, and hence they don't complicate anything further; I'd be
 curious to hear why you think otherwise.)

Perhaps I overstated the case slightly. It is true enough that if
you are working with normalized data, U+0344 gets normalized away:

% egrep 0344 NormalizationTest-3.2.0d6.txt
0344;0308 0301;0308 0301;0308 0301;0308 0301; # ... COMBINING GREEK DIALYTIKA TONOS

and you just end up with an otherwise typical sequence of combining marks.

However, the complication is in the statement of the algorithm,
where you end up having to talk about (and include in your tables)
the Non-Starter Decompositions. See CompositionExclusions.txt, which
has a special section mentioning just these four oddballs:

# 
# (4) Non-Starter Decompositions
# These characters can be derived from the UnicodeData file
# by including all characters whose canonical decomposition consists
# of a sequence of characters, the first of which has a non-zero
# combining class.
# These characters are simply quoted here for reference.
# 

# 0344 COMBINING GREEK DIALYTIKA TONOS
# 0F73 TIBETAN VOWEL SIGN II
# 0F75 TIBETAN VOWEL SIGN UU
# 0F81 TIBETAN VOWEL SIGN REVERSED II

Note also that all four of these characters get use of this character
is discouraged notes in the Unicode names list.

These characters also result in a problematical edge case for
processing of the tables for the Unicode Collation Algorithm to
provide proper weightings.

  does anyone [have] a map from mathematical characters to the
  Geometric Shapes, Misc. symbols and Dingbats that would be useful
  for rendering?
 
 KW As opposed to the characters themselves? I'm not sure what you
 KW are getting at here.
 
 The user invokes a search for ``f o g'' (the composite of g with f),
 and she entered U+25CB WHITE CIRCLE.  The document does contain the
 required formula, but encoded with U+2218 RING OPERATOR.  The user's
 input was arguably incorrect, but I hope you'll agree that the search
 should match.
 
 I'm rendering a document that contains U+2218.  The current font
 doesn't contain a glyph associated to this codepoint, but it has a
 perfectly good glyph for U+25CB.  The rendering software should
 silently use the latter.
 
 Analogous examples can be made for the ``modifier letters''.
 
 I'll mention that I do understand why these are encoded separately[1],
 and I do understand why and how they will behave differently in a
 number of situations.  I am merely noting that there are applications
 (useful-in-practice search, rendering) where they may be identified or
 at least related, and I am wondering whether people have already
 compiled the data necessary to do so.

I don't think so -- at least not officially within the Unicode
Consortium. This is concerned with shape similarities that go
beyond the kind of character folding implicit in the Unicode
Collation Algorithm.

The Unicode names list provides a considerable number of cross-references
for similarly-shaped characters and confusables, but this is, of
course, far short of a detailed listing that could be used as
the basis of a specification for shaped-based folding for search
purposes.

--Ken





RE: Unicode and Security

2002-02-06 Thread Yves Arrouye

 Well, nothing wrong with Unicode of course. Just means that there will
 need
 to be an option in your browser to reject any site without a digital
 certificate, and perhaps it will need to be turned on by default. So,

Nothing prevents sites running frauds to get a certificate matching their
name. If the price of certificates drop, or if the fraud has good margins
enough, it will not even be a big inconvenience.

YA





Re: Unicode and Security

2002-02-06 Thread David Starner

On Wed, Feb 06, 2002 at 07:12:19PM +0100, Lars Kristan wrote:
 Well, I was tempted to join the discussion for a while now, but one of the
 things that stopped me was that I didn't quite understand why it was so
 focused on the bidi stuff.

Because it can have a dramatic effect, whereas changing look-alikes has
no effect on the displayed text.
 
 Yes, it's a fraud. And I want to thank John for pointing that out. But we're
 making it a hell of a lot easier now. In ASCII, all one could try was
 www.examp1e.com and a couple of other tricks, but it was maybe 10 tricks in
 ASCII, some more in case of Latin 1. How many are there with Unicode? U,
 a million?

How often does it matter? I can see registars not registering stuff that
was obviously an attempt to defraud, but you won't get there if you type
it in yourself. It's easier for someone to set up a forged Microsoft
link, but it's easy to check that. Rather than everyone being digitally
signed, just checking if it's multiscript and pop up a warning will
catch most of the cases. You could colorcode the major scripts with
confusables . . .
 
-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.




Re: Unicode and Security

2002-02-06 Thread Barry Caplan

At 11:54 AM 2/6/2002 -0700, John H. Jenkins wrote:
The original focus was on digital signatures, and I still don't get the 
objection.  Because I don't know *precisely* what bytes Microsoft Word or 
Adobe Acrobat use, do I refuse to sign documents they create?  Is that the 
idea?  I mean, good heavens, I don't even know *precisely* what bytes Mail.
app is going to use for this email.  Should I refuse to sign it?


I don't think the main issue is whether or not you should sign it. I think 
the main issue the original poster tired to raise, is that as the recipient 
of such a signed document, he is not persuaded he should trust it.

This is a serious issue, although as several have noted, not a Unicode-only 
one. No one doubts the security of the encryption algorithms used for 
signing. But the issue of trust is critical.

In the analog world, people are expected read and understand documents, and 
in general, the worlds legal systems are set up to recognize that a 
signature (or stamp or seal or whatever) is binding evidence that such care 
was taken (even if it wasn't really taken). In the digital world, 
individual behavior and legal processes both may not be so well formed to 
support the technology of digital signatures. I believe this is what the 
original point was.

IANAL, but enforceability of such a kluged, digitally-signed document seems 
in doubt. There is a long history of that type of contract support in our 
US legal systems, and probably others as well. There will surely be 
difficulties adapting it to the digital domain, but I think the basis for 
support is already there

Anyway, it is not, but maybe should be well known, that the purpose of 
digital signatures, is to verify who the sender is, and to verify that the 
document has not been changed in transit. That it might contain tricky 
language or information is an important thing to note, but the reader still 
needs to rely on the document's contents with the same skeptical eye as if 
it were not printed. Just as the Unicode bi-di algorithm makes no claims at 
reversibility, digital signing algorithms make no claim that the signed 
contents are correct,or even useful.







Oops!

2002-02-06 Thread James E. Agenbroad

The ALA/LC romanization tables ar at: lcweb.loc.gov/catdir/cpso/roman.html
( not .../romanization.html as in my earlier note)

 Sorry,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





ALA/LC Romanization Tables on the Web

2002-02-06 Thread James E. Agenbroad

 Wednesday, February 6, 2002
The scanned pages of the 1997 ALA/LC romanization tables are now available
on the Web:  http://lcweb.loc.gov/catdir/cpso/romanization.html
Note that in lieu of the Wade Giles pages there is a note that pinyin
guidelines are pending.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: A few questions about decomposition, equvalence and rendering

2002-02-06 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Kenneth Whistler wrote:
 ... See CompositionExclusions.txt, which
 has a special section mentioning just these four oddballs:
 
 # 
 # (4) Non-Starter Decompositions
 # These characters can be derived from the UnicodeData file
 # by including all characters whose canonical decomposition consists
 # of a sequence of characters, the first of which has a non-zero
 # combining class.

Shouldn't that say, a sequence of two characters? Taken literally
this definition includes characters with a canonical decomposition
that is a single combining character.

(To forestall the obvious objection, no, using the plural does not
imply more than one: any decomposition is a sequence of characters.)

- -- 
David Hopwood [EMAIL PROTECTED]

Home page  PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPGCgFTkCAxeYt5gVAQFdYAf+JOHLD7dAfZgPT7vAid+Ttt9ojgR3dMUv
tkxu7pC1eqx0h1u9yBkwv42S7r3M41ha6dvwCrlKlxPT1H8nPj+CWP4nhRcWeDxF
8fK+Plk0FxmIedksAXL1vbPbCI5Vf36/O3OFN++oLurGdf+DuA1lZ0WC191njW6V
/+rqRjCPKwSz8UiftLrF9EApjHaSwHH5skO9OZIrbocsfGU44pl3SsJIB0HsjxU4
GAp+HbABJ+67EDH8KtUAa0lHEBKHRoC4a1KWLuFV7E1uLCGH8X2fVbAOYX/jIHEU
I8W9gJDebquu/Vnph3AIlW9MVO1hALWqB80ngZtHBYDbkT9zSXRRQg==
=+vHx
-END PGP SIGNATURE-