date:20030814

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: John Cowan [EMAIL PROTECTED]
 Peter Kirk scripsit:
  On 13/08/2003 11:09, Philippe Verdy wrote:
 
  ... For this reason, defective
  combining sequences (combining characters without a leading base
  character) should be forbidden (invalid for XML).
  
  
  If there is even the remotest possibility of this happening, we need
to
  know quickly!

 As a member of the XML Core Working Group of the W3C, I can assure you
that
 there is not even the remotest possibility of it.

OK, forbidden is possibly excessive.
Do you prefer the terms strongly discouraged in favor of a new
encoding that
could be used by applications that are concerned by security and parsing
issues?
If there's no such new encoding proposed, at least XML Core WG members
could
discuss about the way to solve the security problems. There may exist
some
solutions which I did not think about...

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

- Original Message - 
From: Jon Hanna [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 14, 2003 1:49 PM
Subject: RE: Questions on ZWNBS - for line initial holam plus alef

  I do agree: a XML document could require the use at some place of a
  given attribute or element. If this attribute name follows the
element
  name
  after a line break, which gets changed into a space during parsing,
  forcing
  XML parsers to treat SPACE+combining as a unbreakable grapheme
  cluster acting like a letter would have the effect of creating a new
  element
  name which may violate the lement name identity. Now suppose that
the
  attribute name contains a colon, you have created a custom namespace
  name, under which you can add any element you like, even if this was
  forbidden by the content-model of the reference schema.

 1. SPACE is treated blindly as a SPACE by XML. String + space +
combining
 + string would not be treated as a single token, no matter how that
space
 was introduced. That's what you were complaining about in the first
place
 (as far as I can make out).
 2. While nmtokens can begin with a combining character names cannot,
nor can
 they contain spaces.
 3. This would in no way change the content-model. So even if the above
two
 points didn't hold they would only sneak the document past something
which
 performed validation before parsing(!), and where the content-model
was
 already pretty loose (so it didn't complain about the unrecognised
 attribute).

 You've just discovered a way to disguise one document that isn't
well-formed
 as a different document that isn't well-formed. l33t!

  So this would invalidate existing documents, or create holes
allowing
  insertion of arbitrary XML content, if the XML application is not
  validating extremely strictly the element names (the pair namespace+
  name) and exclude completely from processing any unrecognized
  element (including all its content and attributes).

 This argument is not on friendly terms with the concept of causality.

  This would be a
  breach in the content model which may have been validated and tested
  for security in another layer of the document encoding process
(notably
  when XML documents are created from templates, such as XSL
  processors, or custom C source using simple template substitution).

 Testing validity without testing well-formedness is not possible.

  So for me the sequence SPACE+combining should not be acceptable
  as a valid grapheme cluster within element names or attribute names,

 As it already isn't.

  and thus would need to be excluded from NMTOKEN. The correct
  way to do it is to consider it NOT A LETTER, but a symbol (Sk),
  exactly like other spacing diacritics, which are already invalid in
  NMTOKEN.

 Wait a second. That was my justification for why the fact that
 space+combining is ALREADY prohibited from NMTOKEN shouldn't be
considered a
 failure on the part of XML to allow for freedom of choice with the
strings
 used for NMTOKENs. Now you actually want to introduce this (already
 existent) feature.

  There still remains the unresolved question of grapheme clusters
  that could span the starting  or ending  or / of tags, or
  the leading  of a entitity reference.

 No there isn't. What goes before , , / or  isn't a problem since
those
 are all non-combining characters and a new unit for any sort of
processing
 treating more than one codepoint as a unit. What goes after  or  has
to be
 a name (not an nmtoken) and as such is already prohibited from
beginning
 with a combiner. What goes after  is already dealt with by the
Charmod, and
 even if you ignore charmod apart from the possibility of normalisation
 turning the sequence U+003E, U+0338 into U+226E (a possibility that is
well
 noted) it still isn't going to hurt.

One note: in Unicode, grapheme clusters (considered unbreakable) are
more
than just combining sequences! Look at CGJ, WJ, ZWJ, ...
So what is after or *before* a base character may impact parsing
grapheme clusters!

As the well-formedness of XML documents goes even before its validity
(which is optional, but required in some applications that need to parse
the DOM-tree or InfoSet rather than), this impacts the way Unicode can
be used (read it as embedded) within XML. Depending on where this
encoded text is used (NMTOKENs, text elements, attribute values,...)
the embedding constraints will be different, but in my opinion anonymous
text elements and attribute values should both use the same encoding
capabilities as they both can (should be able to) represent any kind
of valid Unicode plain text.

As SPACE is handled differently in attribute values, this is a problem.
that causes a problem for SPACE+NSM (considered valid but with
imprecise properties for now).

The constraints are less severe in anonymous text elements as there
exists several technics (including CDATA sections) to represent them.
In fact, XML will consider each text element or attribute value as an

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jon Hanna

 OK, it's safe, but it is a misuse of Unicode. As space plus combining
 character is a unit in Unicode, it should be treated as a unit by higher
 level protocols. If higher level protocols are allowed to do arbitrary
 things within Unicode units, there is no end to the possible confusion.
 See for example, from Unicode 4.0 chapter 3:

 C7 A process shall interpret a coded character representation according
 to the character
 semantics established by this standard, if that process does interpret
 that coded character
 representation.

If this is not the case (I'm not entirely sure this bans what XML does with
spaces) then all we would need is a change so that rather than a de facto
ban on space+combining within names and nmtokens we would have an explicit
ban on the same; then we'd all be happy, except possibly for some sadistic
XML application designer that was planning on use that combination out of
ill-will towards his or her colleagues.

Re: Handwritten EURO sign

2003-08-14 Thread Michael Everson

At 23:35 +0200 2003-08-05, Pim Blokland wrote:

I have absolutely no idea what you are talking about.
You are lucky not having to put up with bad English like five euro 
and six cent, living in the Netherlands and speaking Dutch as you 
do. See http://www.evertype.com/standards/euro if you wish to learn 
more about a disaster in language planning.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Conflicting principles

2003-08-14 Thread Kent Karlsson


  Anyway, John J, what code are we talking about that has to 
 work from 
  the positions of the combining marks back to the underlying 
  representation? Are you talking about OCR?
 
 
 No, the issue is more how to start from a base form and work 
 forward to 
 encompass the whole series of characters which need to be treated as 
 one in certain processes, which can include cursor movement, hit 
 testing, display, line breaking, collation, normalization.

Collation isn't really based on combining sequences (even though UTS 10
specifies a certain spanning over non-blocking (combining)
characters).
Note in particular the following entry in the CTT (and with different
syntax in the UTS 10 tables):
U0E4D_0E32 S0E33;BASE;MIN;U0E33 % THAI CHARACTER SARA AM
(and a similar one for Lao). This is a collation entry for a
contraction of a combining mark followed(!) by (formally) a
base character. (I'm not really sure what the true logical sequence
would be, though.)

/kent k

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]
 I note that there is no line break opportunity in space, NBSP. But
is
 there one after the space in space, RLM, NBSP? If so, RLM, NBSP,
 combining character has a third advantage, that it gives the right
line
 break opportunity when this sequence is word initial, which it
wouldn't
 do without the RLM.

How can we be so complicated when a new base character with
the needed properties would be much simpler and easier to support
in implementations?

What is wrong with the encoding of new recommanded alternatives
to SPACE or NBSP, i.e. an invisible symbol, an invisible LTR letter,
an invisible RTL letter? This way we can fix some issues in the current
text of UAX'es but recommand that new writers use a new base
character which will behave correctly without those too complex
hacks that users and implementers won't understand.

Re: Handwritten EURO sign (off topic?)

2003-08-14 Thread Peter Kirk

On 14/08/2003 09:54, Michael Everson wrote:


Lepton in Greek was accepted from the beginning.


Leptó pl leptá.
The same word as the original widow's mite (Mark 12:42). Probably worth 
even less now!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Diacriticals and descents in upper case (was: Re: Caron / Hacek?)

2003-08-14 Thread Anto'nio Martins-Tuva'lkin

On 2003.06.12, 18:38, Philippe Verdy [EMAIL PROTECTED] wrote:

 Capital letters simply don't use ascents or descents, and thus they
 occupy a *smaller* space than the lowercase letters.

Some upper case letters commonly (i.e. in some typical fonts) have
descents, especially, though not only, in italic style:

U+0047 : LATIN CAPITAL LETTER G
U+004A : LATIN CAPITAL LETTER J
U+0051 : LATIN CAPITAL LETTER Q
U+005A : LATIN CAPITAL LETTER Z
U+01B7 : LATIN CAPITAL LETTER EZH
U+0396 : GREEK CAPITAL LETTER ZETA
U+0414 : CYRILLIC CAPITAL LETTER DE
U+0423 : CYRILLIC CAPITAL LETTER U
U+0426 : CYRILLIC CAPITAL LETTER TSE
U+0429 : CYRILLIC CAPITAL LETTER SHCHA
U+046E : CYRILLIC CAPITAL LETTER KSI

 In some cases, there is no space in the font point size to put some
 upper diacritics above the letter, and the diacritic will almost
 always be written after the base character, sometimes with a distinct
 glyph, if the printed lines must fit in narrow lines (to save paper in
 books).

This is indeed the current practice in Czech and Slovak, as said in the
thread, but it's completely out of fashion to do so fo, at least,
Portuguese. Nineteen century books do have E for UC e-acute, but that
has been replaced by É in all quality media for quite a long time.

 the color of a font is not what you think:

Note that what I think may not be what you think I think... ;-)

--   .
António MARTINS-Tuválkin,   |  ()|
[EMAIL PROTECTED]   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 934 821 700 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |

Re: Unicode 4.0 is online at last!

2003-08-14 Thread Rick McGowan

Peter Kirk suggested...

 Interesting and a little embarrassing that Unicode's own documentation
 is not Unicode compatible!

I don't think it's very embarrassing... The Unicode consortium after all  
doesn't produce book editing and typesetting software, we use other  
peoples' software.

I think it's rather amazing that we can now actually produce a PDF of the  
entire book. This is incredibly better than the situation ten years ago.

In any case, perhaps you can you suggest a Unicode conformant authoring  
tool that is up to the task of editing and typesetting the standard itself?  
It must have at least the capability of Framemaker 6 (i.e., tables,  
figures, sectioning, table-of-contents, index, etc) whilst implementing the  
full standard, including all scripts... Even the ones that would be newly  
defined in the next version... ;-)

Rick

Re: Handwritten EURO sign (off topic?)

2003-08-14 Thread Patrick Andries


-  Message d'origine - 
De: Marco Cimarosti [EMAIL PROTECTED]

 Anto'nio Martins-Tuva'lkin wrote:
  After all the euro is a common currency and its figures should be
  written in a common way.

 Why?


Very good question. Multilingual countries like Belgium or Canada already
were or are writing the same amounts using different cultural conventions
depending on the language of the text where they appear.

Otherwise, I'm personally quite flexible if only one convention is used and
imposed upon all, as long as it is the French one ;-)

P. Andries
- o - 0 - o -
Unicode en français
http://pages.infinit.net/hapax
(Traduction de l'UTR 20 en cours)

Re: Compatibility decompositions

2003-08-14 Thread Kenneth Whistler

John Cowan asked:

 I realize that existing compatibility decompositions are a rag-bag,
 especially those marked with the generic compat tag rather than one
 of the specific tags such as font, initial, or super.  I wonder
 what principles, if any, can be enunciated for giving a newly introduced
 character a compatibility decomposition at the present time?

Fortunately, I have just the material to hand to answer such a question --
a file listing all the additions to Unicode 3.2 and Unicode 4.0. We
can look in those tea leaves and divine the probable intentions of
the UTC, based on a pretty good sampling of 2000+ recent character 
additions.

03F9;GREEK CAPITAL LUNATE SIGMA SYMBOL;Lu;compat 03A303F2;

  Reason: uppercase of U+03F2, which has a compatibility mapping

1D2C;MODIFIER LETTER CAPITAL A;Lm;super 0041;
...
1D61;MODIFIER LETTER SMALL CHI;Lm;super 03C7;

  Reason: analogy to existing superscript modifier letters

1D62;LATIN SUBSCRIPT SMALL LETTER I;Ll;sub 0069;
...
1D6A;GREEK SUBSCRIPT SMALL LETTER CHI;Ll;sub 03C7;

  Reason: analogy to existing superscript modifier letters
  (but these are *sub*script)

2047;DOUBLE QUESTION MARK;Po;compat 003F 003F;

  Reason: analogy to existing U+2048..U+2049

2057;QUADRUPLE PRIME;Po;compat 2032 2032 2032 2032;

  Reason: analogy to existing U+2033..U+2034

205F;MEDIUM MATHEMATICAL SPACE;Zs;compat 0020;

  Reason: analogy to existing fixed-width spaces

2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;super 0069;

  Reason: analogy to existing U+207F superscript n

213D;DOUBLE-STRUCK SMALL GAMMA;Ll;font 03B3;
...
2149;DOUBLE-STRUCK ITALIC SMALL J;Ll;font 006A;

  Reason: analogy to existing font variant letterlike symbols

2A0C;QUADRUPLE INTEGRAL OPERATOR;Sm;compat 222B 222B 222B 222B;

  Reason: analogy to existing U+222C..U+222D

2A74;DOUBLE COLON EQUAL;Sm;compat 003A 003A 003D;
2A75;TWO CONSECUTIVE EQUALS SIGNS;Sm;compat 003D 003D;
2A76;THREE CONSECUTIVE EQUALS SIGNS;Sm;compat 003D 003D 003D;

  Reason: symbols were explicitly representing sequences of 
 elements, but were single entities in the math entity set

309F;HIRAGANA DIGRAPH YORI;Lo;vertical 3088 308A;
30FF;KATAKANA DIGRAPH KOTO;Lo;vertical 30B3 30C8;

  Reason: vertical ligated variants of Japanese syllable sequences

321D;PARENTHESIZED KOREAN CHARACTER OJEON;So;compat 0028 110B 1169 110C 1165 
11AB 0029;
321E;PARENTHESIZED KOREAN CHARACTER O HU;So;compat 0028 110B 1169 1112 116E 
0029;
3250;PARTNERSHIP SIGN;So;square 0050 0054 0045;

  Reason: analogy with all the rest of the existing squared
 compatibility characters originating in Korean standards

3251;CIRCLED NUMBER TWENTY ONE;No;circle 0032 0031;21
...
32BF;CIRCLED NUMBER FIFTY;No;circle 0035 0030;50

  Reason: analogy with existing circled number characters

32CC;SQUARE HG;So;square 0048 0067;
...
33FF;SQUARE GAL;So;square 0067 0061 006C;

  Reason: analogy with the rest of the existing squared
 compatibility characters originating in Korean standards

FDFC;RIAL SIGN;Sc;isolated 0631 06CC 0627 0644;

  Reason: explicit request in the proposal to provide decomposition,
 approved by the committees

FE47;PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET;Ps;vertical 005B;
FE48;PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET;Pe;vertical 005D;

  Reason: analogy with existing vertical form variants
  
FF5F;FULLWIDTH LEFT WHITE PARENTHESIS;Ps;wide 2985;;*;;;
FF60;FULLWIDTH RIGHT WHITE PARENTHESIS;Pe;wide 2986;;*;;;

  Reason: analogy with existing fullwidth characters

1D4C1;MATHEMATICAL SCRIPT SMALL L;Ll;font 006C;

  Reason: analogy with the rest of the math alphanumerics

And then there are canonical equivalences added:

2ADC;FORKING;Sm;2ADD 0338;;not independent;;;

  Reason: analogy with the other negated math symbols (and
 allowable under Unicode stability policies because the
 base character U+2ADD was encoded at the same time)

FA30;CJK COMPATIBILITY IDEOGRAPH-FA30;Lo;4FAE;
...
FA6A;CJK COMPATIBILITY IDEOGRAPH-FA6A;Lo;983B;

  Reason: analogy with the treatment of all the other Han
 compatibility characters.
 
So you can see from this that the overwhelming reason for providing
a compatibility (or canonical) decomposition for a newly encoded
character is analogy with the treatment of existing characters
which are arguably just like the character newly encoded.

The reason for that is *consistency* in the standard. It would
be less useful to have some characters treated one way for
decompositions and others (inexplicably, from the point of
view of implementers) treated another.

 In particular, is it sufficient that the character strongly resembles an
 existing character or combination of characters, but for one or another
 reason needs to be distinct from it?

I don't think strong resemblance to an existing character is enough.
There were plenty of examples among the math symbols of symbols

RE: Pre-orders of The Unicode Standard, Version 4.0

2003-08-14 Thread Magda Danish $Unicode$

 -Original Message-
 From: John Cowan [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 14, 2003 10:20 AM
 To: Magda Danish (Unicode)
 Cc: Unicode Core List; [EMAIL PROTECTED]
 Subject: Re: Pre-orders of The Unicode Standard, Version 4.0

 Thanks.  Is the Unicode Consortium in any way benefited (or 
 disadvantaged) if non-members order through it rather than through
Amazon or BN?

The Unicode Consortium has an Associate agreement with both Amazon and
BN so we do benefit from members and/or non-members purchasing the book
through either of them, as long as they follow the link (to Amazon or
BN) from the Unicode website.

Magda

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Mark Davis

Peter, in XML you really don't want to use attributes for any general
text; there are too many restrictions on the content. For example, we
never put translatable text into them. Attributes should really be
treated more like sequences of symbols, with a constrained syntax.

This is also not in violation of the Unicode conformance clause. A
space plus combining
character is a unit in some sense. That is, it is a combining
character sequence (and grapheme cluster). However, there is no clause
that says that such units cannot be changed, or that any particular
sequence of characters cannot be changed; operations such as case
mapping or normalization do just that, they change characters.

There are restrictions on what can be changed *if* a process purports
to not modify the text (C10). But an XML parser is certainly capable
of interpreting a sequence A B, and deciding that it wants to change A
to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek
Alpha, *that* would be a violation of C7. But interpreting a space as
a space, then deciding to modify it, is perfectly legit.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: John Cowan [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, August 13, 2003 05:09
Subject: Re: Questions on ZWNBS - for line initial holam plus alef


 On 12/08/2003 20:28, John Cowan wrote:

 Peter Kirk scripsit:
 
 
 
 2) In attribute values, LF, CR, and TAB characters are normalized
to
 spaces.   Not relevant here.
 
 
 This would be relevant if it is legal for the character after LF,
CR,
 and TAB to be a combining mark. Is this legal? In this case what
was
 previously a defective (but legal) combining sequence would turn
into a
 non-defective one, but the intended whitespace would be lost.
 
 
 
 The point is that there is no such thing as an *intended* line
break in
 an attribute value; it will *always* be translated to a space
before
 the application sees it.  (More exactly, line-break characters can
 be inserted into attribute values, but only with the use of a
numeric
 character reference such as #xA;.)
 
 
 Sorry, I'm confused. Are you saying that the input processing will
 translate line breaks into spaces within attribute values, unless
 inserted as #xA; ? Well, I suppose this is fair enough as it is up
to
 the user not to enter garbage.

 
 
 Not just a rendering glitch, I suspect. If the combining character
is
 combined with the separating space, the space loses many of its
 separating functions, and perhaps keeps a confusing subset of them
with
 all sorts of possibilities of error.
 
 
 
 The space(s) will be used to separate individual tokens at
processing
 time.  No spacing diacritic (either single-character or
space+combining)
 is permitted in a NMTOKEN.
 
 
 OK if this is clearly illegal, but this might restrict use of some
 languages in NMTOKEN. Would NBSP + combining be allowed?

 
 
 At best tokens beginning with
 combining characters will be unusable. At worst they will crash
the
 implementation (and count on someone trying deliberately to do
that!).
 
 
 
 In effect, the combining character will constitute a defective
combining
 sequence at the beginning of the individual token.
 
 Stepping away from the letter of the standard for a moment, there
is
 no real reason to begin a NMTOKEN with a combining character.  It
is
 only allowed is a result of the miscegenation of SGML concepts with
 Unicode ones.
 
 In SGML's original design of tokens, they consisted of letters and
digits
 (and a few punctuation marks, which functioned as letters).  There
were
 four kinds: a NUMBER could contain only digits, a NAME could not
begin
 with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN
had no
 restrictions.  ID and IDREF had the same syntax as NAME with
additional
 semantics.  Later, the categories letter and digit were
generalized,
 by redefining the concrete syntax, to be whatever you wanted, and
were
 renamed name-start and name characters (technically, a name
character
 was a letter *or* a digit).
 
 When SGML was simplified to produce XML, only NMTOKEN, the most
general
 type of token, was kept.  However, in order to keep the semantics
of
 letter and digit in the Unicode world, letter was extended to
be any
 letter and digit to be any digit *or* combining character.  That
worked
 well for ID and IDREF, since treating combining characters as part
of
 digit prevented them from appearing first, as was only sensible.
 
 Unfortunately, NMTOKENs, since there were no restrictions, became
able
 to begin with a combining character, though that made no real
sense.
 To write in a restriction would make it impossible to specify XML's
 concrete syntax in SGML terms, which did not allow for three
different
 classes of characters within tokens.  So we wound up with a
basically
 useless capability that if used will only cause trouble.
 
 
 
 There is some

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: Peter Kirk [EMAIL PROTECTED]

 There is some potential for real trouble here, if one process outputs
an
 NMTOKEN starting with a combining character preceded by a separating
 space, or something else which is changed into a space, and another
 process takes the new space plus combining character as a unit and so
 doesn't recognise the separation. Any hackers and virus programmers
 reading this will soon start flooding the Internet with tokens
beginning
 with combining characters in the hope of crashing implementations or
 finding back doors. Of course this wouldn't have been a problem if
 Unicode had never  defined space plus combining character as legal and
 meaningful. But this is not my problem!

I do agree: a XML document could require the use at some place of a
given attribute or element. If this attribute name follows the element
name
after a line break, which gets changed into a space during parsing,
forcing
XML parsers to treat SPACE+combining as a unbreakable grapheme
cluster acting like a letter would have the effect of creating a new
element
name which may violate the lement name identity. Now suppose that the
attribute name contains a colon, you have created a custom namespace
name, under which you can add any element you like, even if this was
forbidden by the content-model of the reference schema.

So this would invalidate existing documents, or create holes allowing
insertion of arbitrary XML content, if the XML application is not
validating extremely strictly the element names (the pair namespace+
name) and exclude completely from processing any unrecognized
element (including all its content and attributes). This would be a
breach in the content model which may have been validated and tested
for security in another layer of the document encoding process (notably
when XML documents are created from templates, such as XSL
processors, or custom C source using simple template substitution).

So for me the sequence SPACE+combining should not be acceptable
as a valid grapheme cluster within element names or attribute names,
and thus would need to be excluded from NMTOKEN. The correct
way to do it is to consider it NOT A LETTER, but a symbol (Sk),
exactly like other spacing diacritics, which are already invalid in
NMTOKEN.

There still remains the unresolved question of grapheme clusters
that could span the starting  or ending  or / of tags, or
the leading  of a entitity reference. For this reason, defective
combining sequences (combining characters without a leading base
character) should be forbidden (invalid for XML).

So there remains a unsolved conflict here: defective combining
sequences cause security or validity problems in XML documents,
and a non-defective SPACE+combining sequence cause also
security problems. There's no secure choice to represent
spacing diacritics which are not already encoded in a precomposed
form...

Re: Unicode 4.0 is online at last!

2003-08-14 Thread Peter Kirk

On 11/08/2003 17:37, Kenneth Whistler wrote:

Well, I've been promising that good things would come
to those who wait. ;-)
At last, the Unicode website has been updated with the
online chapters for Unicode 4.0. See:
http://www.unicode.org/versions/Unicode4.0.0/

Or just go to the Unicode 4.0 link from the home page.

Enjoy.

--Ken

P.S. Just FYI, Peter K., now it is o.k. for everyone to come
back from their August Unicode vacations. Let the
textual criticism begin!


 

The documentation is great, but I have had some problems copying text 
from it  (with Acrobat Reader 5), in particular with text in small 
capitals  e.g. Unicode character names. For example, I get the following 
from p.44:

The sequence of Unicode characters U+0061 a 
   + U+0308 !  + U+0075 u   

 unambiguously encodes u not a.

I mentioned this on another list, and  received the following as part of 
a reply from an expert on PDF format:

For example, here is some text copied and pasted from the Unicode
Standard, p.44, http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf:
   

	Interesting choice, since this document was NOT produced 
using a Unicode-aware authoring tool - they used FrameMaker 6, which 
doesn't do Unicode.

	FrameMaker was able to pass enough information into Acrobat 
Distiller so that SOME of the fonts used have ToUnicode tables - but 
they appear to be limited to symbol fonts and a few extra glyphs...

	Therefore, without this information in the PDF, Acrobat is 
(understandably) unable to properly extract Unicode-based information 
from the document.
 

Interesting and a little embarrassing that Unicode's own documentation 
is not Unicode compatible!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Pre-orders of The Unicode Standard, Version 4.0

2003-08-14 Thread John Cowan

Magda Danish (Unicode) scripsit:

 To order, please use the the book order form at
 http://www.unicode.org/book/bookform.html

Thanks.  Is the Unicode Consortium in any way benefited (or disadvantaged)
if non-members order through it rather than through Amazon or BN?

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
If he has seen farther than others,
it is because he is standing on a stack of dwarves.
--Mike Champion, describing Tim Berners-Lee (adapted)

Re: Colourful scripts and Aramaic

2003-08-14 Thread Michael Everson

At 13:12 -0700 2003-08-07, Peter Kirk wrote:

Well, it seems to me that in the case of the Aramaic proposal we 
don't even have that. We have an archaic version of the script which 
is now used mainly for Hebrew, and which many scholars still call 
Aramaic (in distinction from paleo-Hebrew) although Unicode calls it 
Hebrew. The Aramaic glyphs are almost all recognisably the same as 
or slight variants on the Hebrew ones. And Hebrew script is already 
used, uncontroversially, for large corpora of Aramaic e.g. in the 
Talmud. Why a new script for the few surviving examples of ancient 
Aramaic in this script?
People. It's the widespread offshoot used throughout the Middle East 
that spawned Brahmic and Uighur and other scripts. It isn't 
necessarily the thing you think is confined to three scraps of 
papyrus or whatever. We aren't working actively on this now. We don't 
have an active proposal. We have something roadmapped, and I for one 
don't want to spend time right now defending its roadmapping to you 
apart from what is in my earlier paper on Semitic scripts. Could you 
turn off the fire alarms?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Unicode 4.0.1 Beta period now starting

2003-08-14 Thread Rick McGowan

The beta period for Unicode 4.0.1 has now started. Detailed information is  
available on the beta page:

http://www.unicode.org/versions/beta.html

Beta versions of Unicode 4.0.1 data files are now available for public
comment here:

http://www.unicode.org/Public/4.0-Update1/

This is the first update of Unihan.txt since Unicode 3.2, and it includes  
a large number of corrections and additions. There are several other minor  
changes to other data files.

The beta period closes on August 18, 2003. Since time is short, developers  
are asked to please focus quickly on the data file review if you have not  
yet done so.

Beta period comments will be reviewed by the Unicode Technical Committee
at the upcoming meeting starting August 25, 2003.

If you have any feedback on any of the beta files, please submit it by
August 18, 2003. You can submit feedback via the online reporting page
here:

http://www.unicode.org/reporting.html

Note: If you are a liaison representative, please forward this message as  
appropriate within your organization.

[hebrew] Re: Roadmap---Mandaic, Early Aramaic, Samaritan

2003-08-14 Thread Michael Everson

Elaine,

I really, really, really don't have time to debug your 
dissatisfaction with the use of the word Aramaic in the Roadmaps. 
This is NOT something anyone is working actively on right now. When a 
proposal comes forth, there will be evidence in it that can be picked 
at.

In actuality, one could make a very good case that all extant Semitic/
extended Aramaic-Moabite-Amorite-Yaudic-Hebrew etc. type alphabetic scripts
between the earliestSinaitic / Wadi El-Hol---and middle Parthian
are font variants
We are not going to encode Phoenican and Samaritan and Palymrene as 
font variants of Hebrew. If you want to write those languages in 
Hebrew script, do so.

Any border(s) you draw will be either completely artificial or mostly
artifical.  That's the problem.
The borders we draw are based on the analyses of script experts.

I gather that you are a font person, fascinated by the aesthetic 
pleasure of wondrous shapes.
I am a lot more than that.

I am a database person, concerned with minimizing unnecessary font 
variation, which may interfere with future overworked Semitic 
retrieval engines.
You will never be at as greater disadvantage than a Sanskritist is, 
considering that the Rg Veda can be written in a dozen or so scripts.

 The Mandaic and Samaritan scripts apparently
 enjoy at least some modern liturgical use.
Yes, they do!   But the Samaritan is also heavily used within
Jewish studies  /  Biblical studies communities.  The Samaritans
also use their shapes in private correspondence.
Then we shall encode them.

  of Aramaic script to encode has not been looked at carefully. Indeed
 we have no current proposals which are well-advanced at this time.
I'm responding now because this may be the only time period where
Hebraists interact with UnicodeCarpe diem..
Hebraists are discussing concerns about METEG and things. You're 
responding about things which don't even have formal proposals to 
respond to. If you want me to start working on encoding other early 
Semitic scripts, please give generously to the Script Encoding 
Initiative and ask for prioritization. Failing that, I will be 
working on things which have higher priority (and more complete 
proposals) at present, like Coptic, Saurashtra, Nuskhuri, Buginese, 
N'Ko, Ol Chiki, Avestan and Pahlavi, and so on.

  I am responding at great length to the Roadmap proposals
 for the Semitic dialects Mandaic, Early Aramaic, and
 Samaritan. 

 We are proposing to encode scripts, not languages.
Yes, that is your take on it.  But scripts are frozen language,
not the liquid language of speech or the gaseous language of
poetry..  You encode scripts so we can manipulate languages
We encode scripts so that we can represent texts. And we will do it, 
as we have, to the best of our ability, but not by lumping everything 
together just because it makes things easy for database programmers.

Best regards,
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kenneth Whistler

Peter responded to Mark:

 On 05/08/2003 14:40, Mark Davis wrote:
 
 Where did you get the notion that space is not a base character? And
 base characters include those that are not control or format
 characters. Space is neither one.
 
 The standard specifically states in a number of places that to exhibit
 a combining mark in isolation you use a space (or NBSP).
 
 Mark
 __
 http://www.macchiato.com
 ►  “Eppur si muove” ◄
 
   
 
 I got this from the Unicode Standard 4.0, as quoted by Jim Allan:

*Mis*quoted by Jim Allan.

 
  In http://www.unicode.org/book/preview/ch03.pdf the space characters 
  in general are given class Zs:
 
   Zs, Zl, and Zp are considered format characters, but their 
  membership in the Z (separator) class takes precedence over their 
  membership in the Cf class, because the General Category assigns only 
  a single value to each character. 

That piece of text is *NOT* a quotation from Chapter 3 of Unicode
4.0. Go to that URL and search for it yourself.

It is quoted from Chapter 4 of Unicode *3.0*, p. 88, in the
discussion of General Category in Section 4.5, General Category --
Normative in Part. The corresponding paragraph has been deleted
from the relevant section in Unicode 4.0, precisely because the
standard now precisely defines format control characters as
{Cf, Zl, Zp} but *ex*cluding Zs. See p. 25 in:

http://www.unicode.org/book/preview/ch02.pdf

 
  So the various space characters (class Zs) are also classified as 
  format characters.
 
  From http://www.unicode.org/book/ch04.pdf:
 
   _D13  Base character:_ a character that does not graphically 
  combine with preceding character, and that is neither control nor a 
  format character. 
 
  Accordingly, by definition, spaces are not base characters.

This conclusion is false. As Mark indicated, SPACE (and NBSP) are
base characters, and have been treated as such in terms of
diacritic application since Unicode 1.0 was published:

By convention, diacritical marks used by the Unicode encoding
scheme may be exhibited in (apparent) isolation by applying
them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This
might be done, for example, when talking about the diacritical 
mark itself as a mark, rather than using it in its normal way
in text.
 -- Unicode 1.0, p. 19 [1991]
 
And that *is* an accurate quote from the standard. In Unicode 4.0
that text survives as:

By convention, diacritical marks used by the Unicode Standard
may be exhibited in (apparent) isolation by applying
them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This tactic
might be employed, for example, when talking about the diacritical 
mark itself as a mark, rather than using it in its normal way
in text.
 -- Unicode 4.0, p. 46 [2003]

I'd say the intent of the UTC and the Unicode Standard in this
regard has always been rather clear and has stayed
unchanged for quite some time.

--Ken

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jon Hanna

the
 solution with
 SPACE is really tricky due to the special treatment of SPACE notably
 in HTML, SGML, XML

I disagree. There are a few different things that happen with whitespace in
such technologies. Some of these only apply to elements that do not allow
any character data apart from whitespace to appear directly within them, and
hence are not an issue here. Some happen at relatively high level of
processing, e.g. rendering (not parsing) of HTML, and as such should
correctly process spaces combined with combining characters.

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.

The other would be with names, qnames, nmtokens and such. These are not
normal textual content; they are human-readable constructs that are based on
normal text because that makes it easier for some developers to work at a
plain-text level (if they speak the natural language that the human-readable
constructs were based on). Support for the linguistic oddity of a dialectic
divorced from the context in which it would normally exist would have little
justification in this place except for fulfilling the general goal of
completeness. Completeness is a laudable aim of course, but extreme
edge-cases need only be brought in if they are both safe and cheap. Anyone
designing an XML application who frequently considers isolated diacritics as
the most natural choice in part of such tokens probably needs to take a
couple of weeks holidays before continuing the design. Of course some of the
characters that could be considered to be precomposed isolated diacritics
are banned from use in nmtokens anyway.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Doug Ewell

Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

 Point taken. But when different fonts and rendering engines give
 different results because the standard is unclear or ambiguous, that
 is a matter for the discussion here. And when conforming fonts and
 rendering engines fail to give the required results, that may also be
 because of a deficiency in the standard.

Or it may not.  It may be a deficiency in the level of Unicode support
afforded by the fonts and rendering engines.  It may simply reflect a
difference between your requirements and what the standard promises,
and doesn't promise.

 It seems that many rendering engines give to the sequence space,
 combining mark the width normally assigned to a space. Is this
 actually what the standard suggests?

The standard doesn't say anything about width in this case.  It leaves
it up to the display engine, which is as it should be.

 I have identified a need to display combining marks with no extra
 width, only the width required by the mark. Should the sequence space,
 combining mark do what I want, or shouldn't it? If so, this needs to
 be spelled out so that rendering engines know what they are supposed
 to do. If not, there may be a need for a new character. This is a
 deficiency in the standard, not in the rendering engines.

When the specific alignment of isolated glyphs is important to me, I use
markup.  I'm a big supporter of plain text, as many members of this list
know, but the exact spacing of isolated combining marks seems like a
layout issue to me.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kent Karlsson


  there is no such thing as NFD decompositions.
 
 Sorry for the confusion. Still even with a NFKD decomposition, 

And there is no such thing as NFKD decomposition either.
It goes as follows, in steps:

1. Canonical and compatibility decomposition mappings (one-step),
   and canonical classes.

2. Canonical and compatibility full/recursive decompositions
and canonical reordering. The compatibility (full) decompositions
make use of both the canonical and compatibility
decomposition mappings.

3. Canonical and compatibility equivalences.

4. The four Unicode normal forms (NFD, NFC, NFKD, and NFKC).

Please don't turn it upside down, that's only confusing!

Ok, the formal definition of equivalences and normal forms
are a bit backwards in The Unicode standard, defining NFD
(in practice, though not the name) before the equivalences.
Normally, a normal form is defined as a particular representative
element in an equivalence class...

But there is no need to aggravate the backwardsness into
cyclicity.

...
 It's true that not all (only most)  combining non-spacing
 characters have a non-combining spacing counterpart.

Only a *few* g.c. Mn characters have spacing counterparts!

/kent k

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Noah Levitt

According to the docs at
http://www.microsoft.com/typography/otfntdev/indicot/other.htm,
uniscribe renders combining marks in isolation when they are
applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted
circle.) Perhaps this is an acceptable solution to the
people calling for a new character.

  Combining marks and signs that appear in text not in
  conjunction with a valid consonant base are considered
  invalid. Uniscribe displays these marks using the fallback
  rendering mechanism defined in the Unicode Standard
  (section 5.12, 'Rendering Non-Spacing Marks' of the
  Unicode Standard 3.1), i.e. positioned on a dotted circle. 

  Please note that to render a sign standalone (in apparent
  isolation from any base) one should apply it on a space
  (see section 2.5 'Combining Marks' of the Unicode
  Standard). Uniscribe requires a ZWJ to be placed between
  the space and a mark for them to combine into a standalone
  sign.

Noah

RE: Newbie Question - what are all those duplicated characters FOR?

2003-08-14 Thread Jill . Ramonsky


Ah, now you're making assumptions about me which are not, in fact, valid.
I'm not quite sure exactly what you mean by the text, but I own a copy of
The Unicode Standard Version 3.0 and have read it pretty much in entirety.
I have also read almost everything I could find on the unicode.org web site.
In none of these sources have I found an answer to this question. It was for
this very reason that I joined this forum, thinking Aha! Maybe someone
THERE might know the answer.

So, Michael, perhaps you might be so kind as to give the URL of the text
to which you refer (or even the page number in the 3.0 book). If I find such
a text, I will most certainly read it.

Stefan has effectively dealt with SOME of my confusion, but questions
remain. For example: between 1D49C (mathematical script capital A) and
1D49E(mathematical script capital C) we find 1D49D (reserved). What is it
reserved for? I am aware that codepoint 212C is script capital B, but why
does that justify leaving a hole in the codepoint space? Why not just omit
mathematical script capital B without leaving a hole? (i.e. why not just
go straight from A to C?).

More questions. From E0020 to E007E we have tag space through to tag
tilde. These are copies of the Basic Latin block at 0020. I still don't
know what they are for. I am, however, VERY keen to learn, and so would
really appreciate it if someone could tell me, or indeed point me in the
direction of the text which explains it.

Thanking you in advance for your help,
Jill


-Original Message-
From: Michael Everson [mailto:[EMAIL PROTECTED]
Sent: Friday, August 08, 2003 6:54 PM
To: [EMAIL PROTECTED]
Subject: Re: Newbie Question - what are all those duplicated characters
FOR?


At 17:46 +0100 2003-08-08, [EMAIL PROTECTED] wrote:
I'm reasonably sure that this question reflects my own ignorance, rather
than some problem with the standard, but nonetheless, I am confused.

Read the text. Don't just read the code charts.
-- 
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Conflicting principles

2003-08-14 Thread Peter Kirk

On 07/08/2003 13:57, John Cowan wrote:

Kent Karlsson scripsit:

 

4) Encode the vowel signs as combining characters, after
   the base characters they logical follow. Consider them as
   double [width] combining characters, that happen to
   have no ink above/below the character they apply to,
   but (like double width combining characters) have ink
   over/under the glyph for the base character that follows.
   

Cool. ...

Agreed!

... But an immediate problem comes to mind: what if there is a
line break between the two base characters?
 

What if there is a line break between the two characters joined by a 
double width combining character?

Are arbitrary line breaks in the middle of words actually permitted 
anyway? Presumably any line breaking property of the first base 
character of the pair is cancelled anyway. That leaves a problem only if 
the second base character has a line break before possibility. Well, 
that could just be treated as one of the sequences we were discussing 
yesterday, not illegal Unicode but its rendering is undefined.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: AL32UTF8 Vs UTF8

2003-08-14 Thread John Cowan

Jay Chandru scripsit:

 I wanted to know the differences between AL32UTF8 and UTF8. My database (oracle) 
 will be in AL32UTF8 format. Will the applications that require multibyte characters 
 work as they are functionin in UTF8 format.

The Oracle UTF8 format is really CESU-8, whereas the AL32UTF8 format is
true UTF-8.  The difference shows up in characters beyond U+, which
are represented with six bytes in UTF8 format, four bytes in AL32UTF8
format.

UTF8 format is not very interoperable, whereas AL32UTF8 format is.
I recommend that you use the latter.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
If I have seen farther than others, it is because I am surrounded by dwarves.
--Murray Gell-Mann

Re: Conflicting principles

2003-08-14 Thread Michael Everson

At 01:18 +0200 2003-08-09, Philippe Verdy wrote:

Such break in a middle of a multiple width diacritic exist in some 
notations, and are not considered horrible typography. Just look 
at musical notations where a upper horizontal parenthesis
is used to group some elements [...]
Music setting is not typesetting, and that kind of music 
representation is outside of the scope of the Unicode Standard.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Curtis Clark

on 2003-08-06 15:24 Doug Ewell wrote:
I'm not a typographer (intelligent or otherwise), but I'm having a tough
time seeing how Section 2.10 *requires* fonts and rendering engines to
give a space-plus-combining-diacritic combination the exact minimum
width of the diacritic alone, or to leave equal space before and after
such a combination.  All I think it is saying is that, for example, the
combination i-plus-tilde may be wider than i alone, because tilde is
wider than i.
Considering that one approach is to use opentype to map a letter plus 
diacritical to a single glyph, an obvious solution would be to include 
space + diacritical combos in that table. An important font issue, but a 
font issue nonetheless.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

24th Unicode Conference - Last week to $SAVE with early-birdrates!

2003-08-14 Thread Tex Texin

REGISTER THIS WEEK AND SAVE
   ON
  EARLY-BIRD CONFERENCE AND HOTEL RATES!

 Are you falling behind?  Version 4.0 of the Unicode Standard is here!
 Software and Web applications can now support more languages with
 greater efficiency and lower cost.  Do you need to find out how? Do
 you need to be more competitive around the globe?  Is your software
 upward-compatible with version 4.0?  Does your staff need
 internationalization training?

 Learn about software and Web internationalization and the new Unicode
 Standard, including its latest features and requirements.  This is
 the only event endorsed by the Unicode Consortium.  The conference
 will be held September 3-5, 2003 in Atlanta, Georgia and is
 completely updated.


Twenty-fourth Internationalization and Unicode Conference (IUC24)
 Unicode, Internationalization, the Web: Powering Global Business

 http://www.unicode.org/iuc/iuc24
September 3-5, 2003
   Atlanta, Georgia, USA

NEWS
 
  Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

  Attend the Showcase to find out more about products supporting
   the Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

  Be an Exhibitor! Show off your product at the premier technical
   conference worldwide for both software and Web internationalization.
   See: http://www.unicode.org/iuc/iuc24/showcase.html

  To find out about, and register for the TILP Breakfast Meeting and
   Roundtable, organized by The Institute of Localisation Professionals,
   and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m.,
   See: http://www.tilponline.org/events/diary.shtml 
   or
   http://www.unicode.org/iuc/iuc24


 KEYNOTES: Keynote speakers for IUC24 are well-known authors in the
 Internationalization and Localization industries:

 Donald De Palma, President, Common Sense Advisory, Inc., and author
 of Business Without Borders: A Strategic Guide to Global Marketing,
 and Richard Gillam, author of Unicode Demystified: A Practical
 Programmer's Guide to the Encoding Standard and a former columnist
 for C++ Report.

 TUTORIALS:  The redeveloped and enhanced Unicode 4.0 Tutorial is
 taught by Dr. Asmus Freytag, one of the major contributors to the
 standard, and extensively experienced in implementing real-world
 Unicode applications.  Structured into 3 independent modules, you
 can attend just the overview, or only the most advanced material.
 Tutorials in Web Internationalization, non-Latin scripts, and more,
 are offered in parallel and taught by recognized industry experts.

 CONFERENCE TRACKS:  Gain the competitive edge! Conference sessions
 provide the most up-to-date technical information on standards, best
 practices, and recent advances in the globalization of software and
 the Internet.  Panel discussions and the friendly atmosphere allow
 you to exchange ideas and ask questions of key players in the 
 internationalization industry.

 WHO SHOULD ATTEND?: If you have a limited training budget, this is
 the one Internationalization conference you need.  Send staff that
 are involved in either Unicode-enabling software, or internationalization
 of software and the Internet, including: managers, software engineers,
 systems analysts, font designers, graphic designers, content developers,
 Web designers, Web administrators, technical writers, and product
 marketing personnel.

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program and Registration form are available at the
   Conference Web site:
  http://www.unicode.org/iuc/iuc24

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   ClientSide News L.L.C.
   Oracle Corporation
   World Wide Web Consortium (W3C)
   XenCraft

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

   Sign up for the Exhibitors' track as part of the Conference.
   For more information, please see:
   http://www.unicode.org/iuc/iuc24/showcase.html

Exhibitors to date:

   Agfa Monotype Corporation
   ASET International Services Corporation
   Basis Technology
   ClientSide News L.L.C.
   LingoPort, Inc.
   Multilingual Computing, Inc.
   The Symbio Group
   The Institute of Localisation Professionals

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Philippe Verdy

On Sunday, August 10, 2003 9:30 AM, Mark Davis [EMAIL PROTECTED] wrote:

  As for oe-ligature, the
  French representative to WG3 (or its predecessor) said that France
  could live without it.
 
 Even worse; the story I heard was that the committee had planned from
 the start to have  and  in positions D7 and F7, but that late in the
 process the representative from France objected, so they replaced them
 by  and . That would certainly explain why these symbols are in the
 middle of a batch of letters...

It's true that in French these are really ligatures, and not plain letters,
meaning that this is mostly a standard typographic convention, rather
than orthographic. The national AFNOR may have opted for this solution
thinking that these holes would have benfited for other languages
commonly used in Europe, and there were probably other candidate
characters that finally got encoded in a separate ISO-8859-* set.

I don't know which compromize was taken, but the origin DEC VT set
also had holes at those positions. It's just strange that the ISO working
group opted for those two characters at D7 and F7, when there could
have been a pair of characters coded for Finnish, or Catalan (like the
dotted L which is still coded with a separate middle dot symbol instead
of a true diacritic, and that renders quite poorly with ISO-8859-1 and
even with Windows 1252). Well, French and Catalan writers have lived
with those encoded sequences, and fixed the rendering using ligating
rules in their renderers or fonts (or used the oe/OE ligatures in
Windows1252).

I just suspect that the French objection on oe/OE was related to the
fear of modifying keyboards that were previously created based on
the French version of ISO646, where such ligature could not be coded.
Since then, the AFNOR version of ISO646-FR has been simplified to
remove the tricky combining sequences built with BACKSPACE,
like C+BACKSPACE+COMMA to code a C WITH CEDILLA, as they
were no longer necessary with a more universally used 8-bit set (7-bit
sets have survived only within Teletex/Videotex standards, built in
accordance with ISO646 with SS2 sequences to encode non-spacing
diacritics *before* the base character with which they combine, to
match the keyboard input order based on dead keys for combining
diacritics, and this 7-bit set is probably the only one remaining in
large use today for French, with ISO646-FR now nearly extinct
in favor of ISO646-US/ASCII)

-- 
Philippe.
Spams non tolrs: tout message non sollicit sera
rapport  vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread John Cowan

Jon Hanna scripsit:

 If this is not the case (I'm not entirely sure this bans what XML does with
 spaces) then all we would need is a change so that rather than a de facto
 ban on space+combining within names and nmtokens we would have an explicit
 ban on the same; then we'd all be happy, except possibly for some sadistic
 XML application designer that was planning on use that combination out of
 ill-will towards his or her colleagues.

Space in any case is not allowed in a token.

There are far worse conformance problems than this anyway, notably the
fact that canonical equivalence is not respected in XML names: a start-tag
that is decomposed and an end-tag that is composed (or vice versa) will not
match.

-- 
The Imperials are decadent, 300 pound   John Cowan [EMAIL PROTECTED]
free-range chickens (except they have   http://www.reutershealth.com
teeth, arms instead of wings andhttp://www.ccil.org/~cowan
dinosaurlike tails).--Elyse Grasso

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 05/08/2003 14:40, Mark Davis wrote:

Where did you get the notion that space is not a base character? And
base characters include those that are not control or format
characters. Space is neither one.
The standard specifically states in a number of places that to exhibit
a combining mark in isolation you use a space (or NBSP).
Mark
__
http://www.macchiato.com
  Eppur si muove 
 

I got this from the Unicode Standard 4.0, as quoted by Jim Allan:

In http://www.unicode.org/book/preview/ch03.pdf the space characters 
in general are given class Zs:

 Zs, Zl, and Zp are considered format characters, but their 
membership in the Z (separator) class takes precedence over their 
membership in the Cf class, because the General Category assigns only 
a single value to each character. 

So the various space characters (class Zs) are also classified as 
format characters.

From http://www.unicode.org/book/ch04.pdf:

 _D13  Base character:_ a character that does not graphically 
combine with preceding character, and that is neither control nor a 
format character. 

Accordingly, by definition, spaces are not base characters.


--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Display of Isolated Nonspacing Marks (was Re: Questions onZWNBS...)

2003-08-14 Thread Michael Everson

At 01:30 +0200 2003-08-10, Philippe Verdy wrote:
Whateer you think, the SPACE+diacritic is still a hack, and 
certainly not a canonical equivalent (including for its properties), 
of the existing spacing diacritics, which also do not fit all usages 
because they are symbols.
It is the formally specified way to represent what you say you want 
to represent. If an implementation doesn't do that nicely enough, 
complain to the implementors. (This has already been suggested to 
you.)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Conflicting principles

2003-08-14 Thread Jon Hanna

what code are we talking about that has to work from the
 positions of the combining marks back to the underlying representation?

Such code is not just common and widespread, it is practically ubiquitous.
The principle of base characters always coming first are used:

Whenever you need to calculate the size of a visual representation of a
string.
Whenever you need to move a caret, or locate the caret position closest to a
cursor position.
Whenever you perform normalisation.
Whenever you insert a substring which may not begin with a base character
into another string.
Whenever you need to guarantee that a portion of streamed text is
sufficiently complete that operations on it won't have to be redone when
more characters are received.
Whenever you need to examine the properties of a character which may change
if combined (e.g. breaking properties can be changed when combined).

This is not code that couldn't necessarily be rewritten to allow cases where
combining marks preceded base characters (though it may become considerably
more complicated, frightfully so in some cases, which in turn would lead
some developers to neglect full support for the scripts that used this new
feature). It is code that is all over the place, much of it would be hard to
track down, and generally unless coders have all nicely isolated the process
of locating combining sequences (and you just know some of them haven't)
it's going to be a mess trying to upgrade.

This doesn't say we should automatically dismiss any proposal to change the
principle, but it does weigh heavily against any such process.

RE: Does Unicode 3.1 take care of all characters of 'Hong Kong Supplimentary Character Set - 2001' (HKSCS-2001) ?

2003-08-14 Thread Kent Karlsson


Aren't the replies about Unicode 3.2 (or maybe 4.0) rather than 3.1?

 1651 - Supplimentary Plane 2 -  \2e80 - \u2f00

Plane 2 covers U+2 to U+2, and is not  in the BMP (= Plane 0).

/kent k

Re: Display of Isolated Nonspacing Marks (problems with UAX#29)

2003-08-14 Thread Peter Kirk

On 10/08/2003 18:44, Doug Ewell wrote:

Has it occurred to anyone yet that the very *concept* of spacing
diacritics is a hack?  Spacing diacritics are used to conduct a sort of
meta-discussion about characters, as in A base character o is combined
with an acute accent  to create .  They are not part of the normal
writing systems of most natural languages.
It is as if I were describing the two typical glyphs used for lower-case
g, the one with one bowl and the one with two bowls, but actually
showing the separate, constituent pieces of the glyphs instead of using
words to describe them.  They are interesting things to talk about, but
not necessarily things that need to be encoded in plain text.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/


 

They are indeed interesting things to talk about, and many people do 
talk about them, and they appear in many texts (including the Unicode 
Standard!). The goal of Unicode is to define characters which people 
use, and that must include documents about languages e.g. dictionaries, 
tutorials, discussions of writing systems etc - which, put together, 
form a significant proportion of publishing output. They are indeed 
meta-content but that does not disqualify them from being  plain text. 
Spacing diacritics clearly come into the category of characters which 
people use, and so should be defined, and properly and  unambiguously 
so, by the standard.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 08/08/2003 09:54, Jim Allan wrote:

...

It certainly makes sense that in the case of space characters that 
have a defined width that this width is innate to the definition of 
the character and in such a case should take precidence over the width 
of the normally non-spacing combining character.

I would welcome clear instructions by Unicode on this point where 
either result would be useful in order than applications may be 
expected to produce results that are consistent with each other. :-)
Agreed!

I would think it would be consistant with Unicode for an application 
to shrink the width of normal space followed by a diacritic such as a 
single overdot as exact formatting behavior is not defined in such cases.
Well, is a space followed by a diacritic actually a space, or is it the 
same code point reused or overloaded By convention (to quote the 
standard) for a logically distinct purpose? Some of the discussions here 
have implied the latter. Either way, the best clarification would be to 
add a character whose explicit function is to form non-spacing variants 
of diacritics.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Unicode 4.0 is online at last!

2003-08-14 Thread Lisa Moore





My congratulations to Ken, Julie, and Eric!  For those who might not know,
this trio (especially Eric with the online bit) get our unadulterated love
and appreciation...Lots of difficulties on the road to online Unicode 4.0
:-) !!

Lisa


- Forwarded by Lisa Moore/Santa Teresa/IBM on 08/11/2003 09:54 PM -
   
  
  Kenneth Whistler 
  
  [EMAIL PROTECTED]To:   [EMAIL PROTECTED]   
  
  Sent by: cc:   [EMAIL PROTECTED] 

  [EMAIL PROTECTED]Subject:  Unicode 4.0 is online at 
last!  
  icode.org
  
   
  
   
  
  08/11/2003 05:37 
  
  PM   
  
  Please respond to
  
  Kenneth Whistler 
  
   
  
   
  




Well, I've been promising that good things would come
to those who wait. ;-)

At last, the Unicode website has been updated with the
online chapters for Unicode 4.0. See:

http://www.unicode.org/versions/Unicode4.0.0/

Or just go to the Unicode 4.0 link from the home page.

Enjoy.

--Ken

P.S. Just FYI, Peter K., now it is o.k. for everyone to come
back from their August Unicode vacations. Let the
textual criticism begin!

Re: Roadmap---Mandaic, Early Aramaic, Samaritan

2003-08-14 Thread Michael Everson

Elaine,

I disagree with you.

Just because Semitic languages *can* be represented in the Hebrew 
script does not mean that every script is just a font variant of the 
Hebrew script.

There are genetic relationships of the development of the scripts 
which are involved in our analysis so far. There are also user 
community concerns. The Mandaic and Samaritan scripts apparently 
enjoy at least some modern liturgical use. The question of what kind 
of Aramaic script to encode has not been looked at carefully. Indeed 
we have no current proposals which are well-advanced at this time. 
But I am not disposed to removing them from the Roadmap at this time 
on foot of the reasons you give.

I am responding at great length to the Roadmap proposals
for the Semitic dialects Mandaic, Early Aramaic, and
Samaritan.  BTW, the larger phylum for these dialects
is called Afroasiatic.
We are proposing to encode scripts, not languages.

Samaritan is a Hebrew dialect, still used today in Israel
in worship/liturgy and probably elsewhere in the Middle East,
with a series of different vowel and other marks, many of
them derived from Arabic.
And a set of base letter glyphs which differs strongly from Hebrew.

But AfroasiaticAramaic, Syriac, Mandaic, Egyptian, Somali,
Hausa, Hebrew, Samaritan, Amorite, Yaudic, Tigrinya, Arabic,
Berber, Moabite, Amorite, Coptic-has not fared as well
as CJKV.
That is because CJK is a moneymaker, and resources are not available 
to those who would like to work on the scripts used by these 
languages.

So here's the problem, which seems to me a clear
language engineering situation:  there are VOLUMINOUS
amounts of material in Egyptian and Akkadian that could be
computerized.  The Hebrew Bible has 1,000 pages of
Hebrew and Aramaic, the Talmud has at least 40,000 pages
of Aramaic and Hebrew.  There's also quite a bit of Ugaritic,
a unique alphabet.
Yes, we know.

But for the Early Aramaic, which can be perfectly
represented in modern Hebrew square script, there are maybe
3 pages of mostly tiny scraps of text, if that much.  For many
of the scraps the question is:  what language is this, actually?--
Aramaic or something else?  But you are proposing a
completely unnecessary script for 3 pages of material, and
make an overworked search engine go through those 3 pages
in a different way than the work it does for the
other thousands of pages of Aramaic in the 6 other scripts.
We are talking about the Aramaic that was enormously widespread and 
was the basis for a number of other scripts. Perhaps Early Aramaic 
is not what it should be called. (Indeed the Roadmap doesn't name it 
so.)

Mandaic is easily represented by Hebrew + one extra letter.
There is more material here, but there is no problem in
seeing it as a variant font.
There is as far as I am concerned.

Samaritan is a Hebrew font variant with interesting different
sets of vowel points.  There's no reason to computerize it
separately, despite the exotic shapes.
I think there is.

Every scrap of early alphabetic Semitic material has different
letter shapes.  It never did become anything like a standard.
Many of these scripts had type designed for them. Scholars did not 
always use Hebrew to represent all of it, nor should they have.

It may be some time before proposals to encode these appear. You and 
others will have an opportunity to examine them.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Handwritten EURO sign (off topic?)

2003-08-14 Thread Stefan Persson

James H. Cloos Jr. wrote:

 Anto'nio == Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] writes:

Anto'nio (Let alone the validity of things
Anto'nio like k, c etc.)
I'm sure things like m, k, M and even G will come into use,
though I expect more will use them in front of the digits.
I certainly use m$, k$ et al, and regulary see others use them.
m and m$ would be millieuros and millidollars.  How could anyone need 
anything like that?  And why use c$ and c, wouldn't  be just as good?

Stefan

Re: Conflicting principles

2003-08-14 Thread Kenneth Whistler

John Cowan asked:

 I would like to ask the old farts^W^Wrespected elders of the UTC
 which principle they consider more important, abstractly speaking:
 the principle that combining marks always follow their base characters
 (a typographical principle), or that text is stored, with a few minor
 exceptions, in phonetic order (a lexicographical principle).

As may often be the case in such hypothetical questions, I
think there is a false dichotomy presumed here.

The principle of the order of combining marks results from the
need to resolve the following architectural question for the
standard:

   Does a combining mark apply to the base character that
   precedes it or to the base character that follows it?
   
   In other words, does á = 0065, 0301 or does á = 0301, 0065?
   
There can only be one right answer to that question, while having
a coherent, interoperable character encoding standard.

The choice that the Unicode architects made on this principle in
1989 is sacrosanct and inviolable.

The principle of logical order of encoding results from the
need to resolve the following architectural question for the
standard:

   Is a right-to-left script encoded in visual order in
   the backing store or in phonetic (= logical) order?
   
   In other words, is tsava spelled 05E6, 05D1, 05D0 or
   05D0, 05D1, 05E6.
   
There can only be one right answer to that question, while having
a coherent, interoperable character encoding standard.

The choice that the Unicode architects made on this principle in
1989 is sacrosanct and inviolable.

Everything else is just working out the details for making actual
script encodings consistent in the context of those overarching
principles. The status of a character as combining or not is
up for grabs, depending on the analysis of a script's behavior
and how it should be represented. And the layout for actual
display of rendered texts does not, and never has, slavishly
followed logical order in lockstep.

Again, everyone, if you haven't already, go back and meditate
some more on the fundamental mandala of Unicode: Figure 2-3,
Unicode Character Code to Rendered Glyphs, which illustrates
both issues of combining mark order with respect to base
character and general logical order of characters as applied
to a particular script encoding (Devanagari).

And don't miss the following piece of text associated with that
figure:

  The Unicode Standard documents the default relationship
   between character sequences and glyphic appearance for the
   purpose of ensuring that the same text content can be
   stored with the same, and therefore interchangeable,
   sequence of character codes.
   
This should, IMO, be put up on a pedestal and have the spotlights
shined on it. This is the *fundamental* obligation of a character
encoding standard. If you cannot accomplish this, then you just
have a bunch of charts full of pretty pictures, and everyone is
on their own for trying to figure out how to communicate with
anybody else using them.

 As someone or other said, I believe that hitherto -- *hitherto,* mark
 you -- [we have] entirely overlooked the existence of, well, scripts
 that might cause a conflict between these esteemed principles.

The reason why the UTC should tackle the encoding of Tengwar
is not so much because it would help in the publication of Elvish
poetry, but because confronting the architectural issues
it poses for encoding would make an excellent tutorial case
for how the two principles of combining mark order and
logical order impact the task of coming up with an appropriate
encoding for a complex script. And it would starkly illustrate
the fact that an appropriate character encoding does not
necessarily directly reflect the phonological structure of
a language as represented by that script.

--Ken

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: Jon Hanna [EMAIL PROTECTED]


 I was saying that it wouldn't be sensible to begin a line with a
 combining diacritic, since that combining diacritic would be combining
 with a newline character which it's difficult to think of any possible
 sensible meaning for.

A newline is a control with a whitespace property and a line-breaking
behavior. It must not combine with a combining diacritic, according to
the UAX definition of grapheme clusters.

So newline+NSM is clearly defective and must be parsed as two distinct
combining sequences, the first one for the newline sequence, the second
one being defective as the combining character does not have a base
character to which it applies (the standard suggests using a dotted
circle to render it in editors, but suggests nothing for the rendering
of final documents, which could simply drop the defective sequence or
display it with a replacement base character, or use a dotted circle, or
a invisible glyph. So the result in this case is implementation
dependant, and not interoperable.

For me the term difficult is inappropriate. In fact it is invalid for
interoperability (even though it is valid, not forbidden, for
ISO10646/Unicode, as an string fragment for intermediate processing),
and such sequence should not occur in actual documents, out of any
external processing context which defines its behavior.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kenneth Whistler

Ted Hopp asked:

 I believe that reasonable people might reasonably conclude from factoids 1
 and 2 that SPACE is indeed a format character.
 
 Reasonable, but evidently wrong. Explanation, please?

I provided the text deconstruction in my last email, but to
continue, the confusion arises from the strange nature of
SPACE in the history of character encoding.

SPACE, for a long time now in the history of character encodings,
has been classified as a *graphic* character. Certainly, in
the general SC2 character encoding context of ISO 2022,
SPACE always shows up in the G0 set, with other graphic
characters, instead of in the various control functions
encoded in C0 or C1 sets.

But looked at from the legacy of device control, SPACE
could just as well been categorized as a control function:
MOVE PRINT HEAD ONE UNIT RIGHT, comparable to BACKSPACE.

And in the context of the Unicode Standard, people often
loosely talk about space characters as being format
characters, since they are a) more akin to punctuation than
normal letters, b) have no glyph associated with them,
and c) impact line-breaking and other aspects of the formatting
of characters in their vicinity.

But the *formal* categorization of Unicode characters,
defined by the UTC to help eliminate this kind of
ambiguity in talk about the character types, is spelled
out in Figure 2.5 of Unicode 4.0 now:

http://www.unicode.org/book/preview/ch02.pdf

and the *formal* meaning of format control character
(Basic type = Format) in Unicode is now any character 
with the General Category of {Cf, Zl, Zp}.

The space characters are all lumped in with graphic characters.

So while there are still some ambiguities to be worked out
in the definition of base character in the Unicode Standard,
neither the status of SPACE as a graphic character nor the
recommendation of the standard that non-spacing marks be
applied to SPACE as a means of showing them in isolation
is in question.

--Ken

Re: Handwritten EURO sign (off topic?)

2003-08-14 Thread Michael Everson

At 00:52 +0100 2003-08-14, Anto'nio Martins-Tuva'lkin wrote:

  Using the cent sign is mostly US specific and the symbol is not
 recognized as such in most European countries, so the cent sign is
  bound directly to the dollar.

If the dollar sign can be used for currencies other than the USD, even
for some which name is not even dollar, then I suppose there is a
theoreitical possiblity that it may be used as a symbol of euro cent
(though I personally prefer c*).
There is no reason that the noble ¢ cent sign 
should not be used for the European currency. 
Personally I always use it, because 2¢ looks 
like two cents and 2c looks like two cee.

In Ireland of course when we used pence we wrote 2p and said two pee.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jim Allan

Ken Whistler posted:

Of course a standard which mandates space folding is also
within its rights to mandate, for example, the non-use of
nonspacing marks applied to SPACE characters. It can simply
rule out such sequences as valid for its context, in which
case the problem goes away. 
And for such standards or applications one can usually use U+00A0 
NO-BREAK SPACE to force multiple spacings.

One can also use this followed by a non-spacing combining character to 
call for rendering of that combining character in isolation.

My feeling is that because of the special qualities of regular SPACE 
using NBSP (U+00A0) should be the more robust way to go.

Essentially, since the Unicode specifications say that a non-spacing 
diacritic can be applied to any base character, including the spaces, it 
is up to fonts and other presentation software to support this and to 
try to make the results look good according to othrographic and cultural 
expectations, just as it is with any text coded in Unicode.

Sometimes fonts don't do this. I would not at all be surprised to find 
for example that _g_ followed by U+0325 COMBINING RING BELOW would come 
out with the combining ring overlapping the tail of the _g_ unless I 
were using a font especially designed for linguistic use.

I would not be at all surprised that some fonts and display devices 
wouldn't justify NBSP + COMBINING DOT BELOW at the beginning of a line. 
But good typographical fonts should justify such combinations and should 
presumably change the width of NBSP when appropriate.

Such changes of width and shapes are what one finds with ligatures in 
fonts that support ligatures.

Jim Allan

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Kenneth Whistler

Peter Kirk wrote:

 I think this may be a Peter mistake. I meant to refer to spacing 
 diacritics. Sorry.
 
 It is certainly highly inappropriate for spacing diacritics to 
 be considered word boundaries.

Why? It is entirely dependent on the orthography and conventions
involved. There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.

Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.

quote
2 Conformance

This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words 
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.

This specification is a emphasisdefault/emphasis mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...
/quote

The whole UAX is informative. It is a here's-how-you-can-approach-
the-problem implementation guide with some suggestions for
rules and classes.

*If* you are working with an orthography that uses one or more
spacing diacritics, and
*If* those spacing diacritics need to be represented by
SPACE, NSM sequences,

then you are in the situation where your implementation of
text boundaries should take SPACE, NSM sequences explicitly
into account, so as to result in expected behavior for that
orthography.

Everyone has had experiences with their platform UI producing
bad results for text boundaries. The Solaris platform I am
writing this on right now, for example, implements a double-click
word selection that treats the string `this', above, including
the grave accent, the apostrophe, and the comma, as a word.
Is that right or wrong? Well, it depends on what you are trying
to do, I expect.

But even the most sophisticated platform implementers can only
do so much with processes like default word selection. It is
bound to be wrong for one purpose or another and for one
orthography or another. Ultimately you need to have tailored
processes that can be orthography-specific if you want to
get best results.

--Ken

Re: Conflicting principles

2003-08-14 Thread Peter Kirk

On 06/08/2003 14:04, John Jenkins wrote:

Speaking purely as an old fart, I'd say the former.  We already break 
the latter principle in Thai and Lao, and having be prepared to scan 
either forward or backward from a base character in order to find its 
combining marks would add overhead to a lot of code, including 
existing code.

On Wednesday, August 6, 2003, at 2:16 PM, John Cowan wrote:

I would like to ask the old farts^W^Wrespected elders of the UTC
which principle they consider more important, abstractly speaking:
the principle that combining marks always follow their base characters
(a typographical principle), or that text is stored, with a few minor
exceptions, in phonetic order (a lexicographical principle).


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage..mac.com/jhjenkins/


This answer presupposes that there is a well-defined concept of which 
base character a combining mark belongs to. That is not always true. The 
particukar combining mark which precipitated the debate may be situated 
above the gap between the (logically and phonetically) preceding and 
following characters, or may move on to the preceding or the following 
characters depending on the precise context and on the typographer's 
preference.

Anyway, John J, what code are we talking about that has to work from the 
positions of the combining marks back to the underlying representation? 
Are you talking about OCR?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Valid encodings

2003-08-14 Thread Jony Rosenne

We need an official Unicode Lint.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy
 Sent: Thursday, August 07, 2003 4:28 PM
 To: [EMAIL PROTECTED]
 Subject: SPAM: Re: Questions on ZWNBS - for line initial 
 holam plus alef

 On Thursday, August 07, 2003 2:40 AM, Doug Ewell 
 [EMAIL PROTECTED] wrote:

  Kenneth Whistler kenw at sybase dot com wrote:

   But I challenge you to find anything in the standard that
   *prohibits* such sequences from occurring.

  I've learned that this question of illegal or invalid character 
  sequences is one of the main distinguishing factors between 
 those who 
  truly understand Unicode and those who are still on the Road to 
  Enlightenment.

  Very, very few sequences of Unicode characters are truly 
 invalid or 
  illegal.  Unpaired surrogates are a rare exception.

  In almost all cases, a given sequence might give unexpected results 
  (e.g. putting a combining diacritic before the base character) or 
  might be ineffectual (e.g. putting a variation selector before an 
  arbitrary character), but it is still perfectly legal to encode and 
  exchange such a sequence.

 For Unicode itself this is true, but what users want is 
 interoperability of the encoded text with accurate rendering 
 rules. In practice, this means that any undefined or 
 unpredictable behavior will mean lack of interoperability and 
 should not be used.

 The standard should then highly promote what is a /valid/ 
 encoding for text with regard of interoperability for all 
 text processing algorithms including parsing combining 
 sequences, collation, and computing character properties from 
 those /valid/ encoded sequences.

 We don't have to care much if some encoded text considered 
 valid under Unicode/ISO-IEC10646 is rendered or processed 
 differently or unpredictably, provided that this does not 
 affect common text for actual languages.

 In fact the standard specifies that ALL sequences made of 
 code points in U+ to U+10 (excluding U+xFEFF, U+x 
 and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC 
 10646, but it does not attempt to assign properties or 
 behavior to ALL of these characters or encoded sequences, as 
 this is the job of Unicode to specify this behavior.

 If there's something to enhance in the Unicode standard (not 
 in the ISO/IEC 10646), it's exactly the specification of 
 interoperable encoded sequences. This certainly means that 
 concrete examples for actual languages must be documented. 
 Just assigning properties to individual ISO/IEC 10646 
 characters is not enough, and Unicode should concentrate more 
 efforts in the actual encoding of text and not only on 
 individual characters.

 So for me, the validity of text is a ISO/IEC 10646 concept 
 (shared now with Unicode versions for the assignment of 
 characters in the repertoire), related only to the legally 
 usable code points, and Unicode speaks about well-formed or 
 ill-formed sequences, or about normalized sequences and 
 transformations that preserve the actual text semantics.

 There is no ambiguity in ISO/IEC 10646 for the character 
 assignments. But composed sequences are the real problem, for 
 which Unicode must seek agreements: the W3C character model 
 is only based on the simplified combining sequences, but 
 Unicode should go further with much more precise rules for 
 the encoding of actual text, even before any attempt to 
 describe other transformation algorithms (only the NF* 
 transformations have for now a stability policy, but actual 
 text writers need also stability for the text composition 
 rules for actual languages.

 We certainly don't need more assigned code points for 
 existing scripts. But more rules for the actual 
 representation of text using these scripts, and how distinct 
 scripts can interact and be mixed. There's some rules already 
 specified for Combining jamos, or combining 
 Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but 
 we are still far from an agreement for Hebrew, and even for 
 some Han composed sequences, which still lack a specification 
 needed for interoperability.

 The current wording of Unicode validity is for me very 
 weak, and probably defective. What it designates is only a 
 ISO10646 validity for used code points, and the validity of 
 their UTF* transformations, based on individual code points. 
 The kind of validity rules users want with Unicode is a 
 conformance of the actually encoded scripts for actual 
 languages, for interoperability and data exchange.

 The fact that Unicode is born by trying to maximize the 
 roundtrip convertibility with legacy codepages or encoded 
 character sets has introduced many difficulties: first the 
 base+combining characters model was introduced as fundamental 
 for alphabetized scripts with separate letters for vowels. 
 Then there's the case of Brahmic scripts which complicates 
 things,

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 08/08/2003 08:54, Philippe Verdy wrote:

... Could there be another codepoint assigned that has

these properties:

20CF;ZERO WIDTH SYMBOL;Sk;0;ON;compat 0020N;

i.e. being considered symbolic, not a whitespace, with
combining class 0 (not combining), and used as an
explicit base for a isolated spacing diacritic to never show
with a dotted circle? (note U+20CF is just a suggestion, as
it fits at end of the symbolic block used for currency symbols,
just before the extended combining characters block, and
because the U+02XX block where other Sk spacing
diacritics are defined is full).
The compatibility decomposition to a space is to make it
in sync with other compatibly decomposable spacing
diacritics.
The new character would allow to represent diacritics that currently
don't have a spacing counterpart, and use them as if they were letter
like. Let's look at a similar diacritic which currently has an existing
precombined spacing version:
00B4;ACUTE ACCENT;Sk;0;ON;compat 0020 0301N;SPACING ACUTE



 

Philippe, this sounds like an excellent suggestion, at least in general 
terms. There is a missing function here, which has been provided (since 
Unicode 1.0) by overloading the characters space and NBSP with an 
inappropriate second function. Of course we can't make existing practice 
illegal, but we can recommend that in future versions of the standard 
your new ZERO WIDTH SYMBOL character should be used for display of 
isolated diacritics where there is no separate spacing form. We can also 
suggest that the width of the combination should be that of the 
diacritic only.

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are 
suggesting other uses in which it really has zero width. Well, it might 
have in a case like line initial holam which shifts on to a following 
silent alef, but that is a rather special case.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Which ancestral links

2003-08-14 Thread Raymond Mercier

Indeed, pardon my haste, that was a matter of an addition to the Syriac
script. For a comparison of the various scripts used for Sogdian,

http://iranianlanguages.com/midiranian/sogdian.htm#Alphabet

Raymond


- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, August 08, 2003 5:43 PM
Subject: Re: Which ancestral links



 At 17:26 +0100 2003-08-08, Raymond Mercier wrote:
 John Clews writes:
 
   I've never seen a description of the Sogdian
   alphabet (i.e. I have never come across one): is there a good article
   or URL which illustrates such links?
 
 Here is a Unicode proposal for just that:
 
 http://wwwold.dkuug.dk/jtc1/sc2/wg2/docs/n2422.pdf

 That is not the Sogdian script.
 --
 Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 08/08/2003 13:56, Thomas M. Widmann wrote:

Peter Kirk [EMAIL PROTECTED] writes:

 

On 08/08/2003 08:54, Philippe Verdy wrote:

   

... Could there be another codepoint assigned that has

these properties:

20CF;ZERO WIDTH SYMBOL;Sk;0;ON;compat 0020N;
[...]
 

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
are suggesting other uses in which it really has zero width. Well, it
might have in a case like line initial holam which shifts on to a
following silent alef, but that is a rather special case.
   

What would be a better name?  ACCENT CARRIER?

/Thomas
 

Perhaps CARRIER FOR COMBINING CHARACTERS - not COMBINING CHARACTER 
CARRIER as that gives the wrong idea that this should itself be a 
combining character, it should not.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Roadmap---Mandaic, Early Aramaic, Samaritan

2003-08-14 Thread Kenneth Whistler

Elain Keown responded to Michael:

  I really, really, really don't have time to debug your 
  dissatisfaction with the use of the word Aramaic in the Roadmaps. 
  This is NOT something anyone is working actively on right now. When a
 
 I'm not writing about nomenclature---not the point all.
 I'm objecting to your endlessly fracturing such closely related
 scripts into completely different blocks, thus making Afroasiatic
 even harder to handle than it has to be.

Back to Michael's point: This is NOT something anyone is working
actively on right now. There are no active proposals for Nabatean,
Palmyran, Mandaic, whatever... Whether or not these end up encoded
in separate blocks is a matter of future debate, *when* an active
proposal or proposals are on the table stating the issues.
  
 You have no Semitists in your e-world, there is no one to fight you,
 no one except me and a few Hebraists care about the fate of
 electronic Afroasiatic.

I don't see how this is the case, given that you earlier scoped
Afroasiatic to include Ugaritic, Egyptian, Akkadian, and other
scripts.

  The borders we draw are based on the analyses of script experts.
 
 You've never had a Semitic script expert, that's the problem.

This is nonsense. We are beset with Semitic script experts.
What you might mean is that Michael doesn't have to hand an
expert on your range of early Aramaic scripts, in particular.
Or are you claiming that Hebrew and Arabic are not Semitic
scripts?
  
 If you continue at the rate you are going, you will continue to
 build codes that will torture me until I die.

If your strongly stated opposition to encoding Mandaic,
Samaritan, and early Aramaic (which you have subsequently
weakened by admitting, for example, that there is a separate
community of usage of Samaritan) means that you don't want
to represent some collection of early Aramaic scripts with
separate characters (but instead wish to display them as
font variants of Hebrew), then nobody is going to stop you.
As Michael indicated, you are perfectly free to represent them
all in Hebrew, if that is the best solution for your research.

And if the relations are as transparent as you indicate, then
conversion of other corpuses to match your own conventions
should be reasonably trivial, in any case.

--Ken
  
 
 This isn't an abstract and charming problem, like the conlangs, 
 these are real languages and real software will be built for them.  
 Maybe you have little interest in our small user community, but we are
 at least as large as the Samaritan one, although I admit they
 have far more interesting customs.
 
 Elaine

Re: Assume everything on this list is ignored (was Re: Newbie Question - what are all those duplicated characters FO R?)

2003-08-14 Thread John Cowan

Mark Davis scripsit:

 I repeat again. Nothing on this list has any guarantee that it will be
 seen by anyone in the UTC. If you want to submit a FAQ question that's
 great -- and I strongly encourage it. But please use:
 http://www.unicode.org/reporting.html to make sure it is tracked.

Hearing and obedience.

-- 
Work hard,  John Cowan
play hard,  [EMAIL PROTECTED]
die young,  http://www.reutershealth.com
rot quickly.http://www.ccil.org/~cowan

Note about CGJ in current MS implementation

2003-08-14 Thread John Hudson

A note for those interested in how CGJ may be used in font lookups:

In the current MS implementation (Office 2002, Wordpad, etc.) if CGJ is 
inserted immediately after a space character it breaks RTL directionality. 
So for the time being at least, any use of CGJ to affect rendering in 
Biblical Hebrew (where it is really proving very useful in a variety of 
ways) requires that CGJ always be preceded by something other than space.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

RE: Conflicting principles

2003-08-14 Thread ekeown

 Madison

Hi,

Only two people asked me what else exists
in the complete Hebrew character set, but
maybe others care.  

The significant points here are that there are 
other pointing systems to be combined with base
letters and that there are manuscripts that have 
TWO pointing systems marked on EACH consonant, 
sometimes two Hebrew ones, sometimes a Hebrew one 
AND an Arabic one.  And sometimes, in exotic Karaite 
manuscripts, there are Arabic letters with Tiberian 
pointing--there are some of these in England, 
Cambridge U, I think--Elaine
__

THE COMPLETE ARAMAIC / HEBREW CHARACTER SET 
(PRELIMINARY--missing 11 Jewish dialects, 10 still 
spoken)

Section A  Ancient or common symbols
   Net Count (subtracts overlap)
original 22-letter alphabet   22
Epigraphic punctuation 4?
Epigraphic numbers11
Ezra's points  2
Medial letters 5
Tiberian pointing, etc52
Other Hebrew ms symbols  __7_
TOTAL100?

SECTION B VARIANT LETTERS FOR
   REGIONAL JEWISH LANGUAGES
Arabic (=Judeo-Arabic) 4
Berber (=Judeo-Berber) 0
Persian ()3
Tajik (=Bukhari)   2
Tat2
Krimchak   1
Neo-Aramaic (=Kurdit)  1
Greek (written in Hebrew..)1
French (written in Hebrew..)   3
Shuadit, Comtadin (Provencal
  written in Hebrew)   0
italian1
Ladino 2
Yiddish3
Net subset totals 20

SECTION C  BABYLONIAN POINTING ETC
BAbylonian35

SECTION D PALESTINIAN POINTING ETC
Palestinian   18

SECTION E SAMARITAN POINTING ETC
SAMARITAN 12

Net subtotals C,D,E   65

SECTION F RARE OR UNIQUE SYMBOLS
Palmyrene dotted resh  1
Bodleian Hebrew e631
Cairo Codex1

Total Aramaic / Hebrew to date   188 ?

I have the file with footnotes, but I don't know where--
packed somewhere

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Peter Kirk

On 05/08/2003 09:42, Jim Allan wrote:

Peter Kirk posted:

If I want to do this, should I explicitly encode a dotted circle, or
should I encode nothing and expect the font to generate the dotted
circle, as it often does? 


I think that practise of a font or application automaticaly inserting 
a dotted circle under an orphaned combining character is dubious 
compliant with Unicode specifications.

...


Thanks, Jim, for all this data, but now I am totally confused. Well, at 
least it seems clear that if I want a dotted circle I should explicitly 
encode it. But if I don't...

Suppose for example I want to write a sentence like In this language 
the diacritic ^ may appear above the letters ..., but instead of ^ I 
want to use a combining character, a regularly positioned centred above 
the letter diacritic, which does not have a defined spacing variant. I 
don't want a dotted circle. And I want it to be spaced as here, i.e. 
with one space before the diacritic and one after it. It seems to me 
that at one place in the standard I am told to encode space - combining 
mark - space, for the combining mark will not combine with the space 
because the space is not a base character; and in another place I am 
implicitly told to encode space - space - combining mark - space, 
because the second space acts as a carrier for the combining mark.

I hope that wanting to display this correctly is not another place where 
I have stepped over the boundaries of what is reasonable to expect 
plain text to convey, but that this too can be grist for the Unicode 
5.0 mill to grind very finely - both quotes from Ken Whistler earlier 
today. And I think that if this issue is clarified it will also become 
clear what should be done about string initial holam and alef etc.

Perhaps a simple way ahead would be to define a new character something 
like COMBINING MARK HOLDER with no glyph, which is defined specifically 
for this purpose, is a base character and not a format character, and is 
expected to be just as wide as is necessary to display the combining 
mark. Then we could say that a spacing accent is equivalent  (possibly 
even canonically if made a composition exclusion?) to COMBINING MARK 
HOLDER plus a non-spacing accent, and remove the misleading 
compatibility equivalences to SPACE plus a non-spacing accent.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

IETF, W3 ....?

2003-08-14 Thread ekeown

  Elaine Keown
  still in Madison

Dear John Cowan and Peter Kirk:

Could you possibly explain to me why these
other organizations---IETF and W3-- are
apparently concerned about character properties,
to the point where apparently they also have
a hand in deciding what will happen with
Hebrew?

For a long time, I thought that the
gatekeepers were the UTC and the people
in Tel Avivso there are these others?

Elaine

Aramaic scripts

2003-08-14 Thread Raymond Mercier





There are omissions in Michael 
Everson's chart in 
http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2311.pdf 


The chart was based on Semitic languages, although 
purporting to be about scripts. After all Greek and Latin also derive 
from the same family of scripts, as we all learn from page 1 of Greek grammars. 


There are less obvious omissions:

1. Kharoshthi, a RtoL script much used in North 
WestIndia, and regarded by everyone as a derivative from a form of the 
Aramaic script used in that region. It is found on coins, Ashokan edicts, 
various inscriptions andmanuscripts. It was used to write mainly prakrits, 
although some sanskrit text is known. See, for example, A.H. 
Dani, Indian Palaeography, Oxford 1963.

2. Pahlavi, widely used to write Middle Persian.This 
involved a troublesome mixture of Persian reading of Aramaic words, a 
subject requiring more elaboration than is needed here.


Raymond Mercier

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jon Hanna

 3) In attribute values that have a declared type other than
 CDATA, multiple
spaces are compressed to a single space, and leading and
 trailing spaces
are removed.  After this is done, there can be no spaces in attributes
of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
In the types IDREFS and ENTITIES, spaces are used to separate
individual tokens, none of which may begin with a combining character.
In the remaining type, NMTOKENS, individual characters may begin
with a combining character, so it is possible that such a token, if
not the first in the attribute, will be rendered in a peculiar way,
with the combining character placed over the separating space.
But that is a mere rendering glitch and in no way affects anything.
 
 
 Not just a rendering glitch, I suspect. If the combining character is
 combined with the separating space, the space loses many of its
 separating functions, and perhaps keeps a confusing subset of them with
 all sorts of possibilities of error. At best tokens beginning with
 combining characters will be unusable. At worst they will crash the
 implementation (and count on someone trying deliberately to do that!).
 The only safe thing to do is to specify that space followed by a
 combining mark is NEVER considered to be a space and this combination is
 NEVER generated.

No, the safe thing to do (and the thing that is done) is to treat the space
as a space ignoring the fact that the NMTOKEN contains a combining
character, this is even safer than your suggestion since it can't
mis-identify the combining properties of a character.

This effectively bans space+combining (and for that matter NBSP+combining
since NBSP isn't allowed in NMTOKENs) within an NMTOKEN and means that if
you attempt to begin an NMTOKEN with space+combining it will be treated as
beginning with the combining character.

The resulting lost of expressive power in having this banned is negligible,
it means that you can't use what is quite a linguistic oddity
(space+combining is mainly used in meta-discussion of combining marks as was
mentioned earlier) in a context where it is human-readable (hopefully) but
not fully general text. NMTOKENs should only be given raw to a user by
relatively low-level tools (i.e. general purpose XML tools for developers),
in other contexts they should be represented by a more user-friendly and
application-appropriate indicator (perhaps text, perhaps not) so the
inability to use space+combining won't apply at that level.

Re: IETF, W3 ....?

2003-08-14 Thread John Cowan

[EMAIL PROTECTED] scripsit:

 Could you possibly explain to me why these
 other organizations---IETF and W3-- are
 apparently concerned about character properties,
 to the point where apparently they also have
 a hand in deciding what will happen with
 Hebrew?
 
 For a long time, I thought that the
 gatekeepers were the UTC and the people
 in Tel Avivso there are these others?

The IETF and the W3C do not care in the least what properties are assigned
by the Unicode Consortium to any specific character, or what treatment
is given to any specific script.

They do care very much that the Unicode Consortium, having made certain
guarantees of stability (viz. that certain character properties would
not be changed), abides by those guarantees.

It's pretty well agreed by those who care that the combining classes of
Hebrew vowel signs were assigned badly.  Unfortunately, nobody pointed
out the problem (or not forcibly enough) during the period 1991-1999
when something could have been done about it.  It's too late to do
anything about it now without breaching those guarantees.  The Unicode
Consortium's word is its bond.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
I must confess that I have very little notion of what [s. 4 of the British
Trade Marks Act, 1938] is intended to convey, and particularly the sentence
of 253 words, as I make them, which constitutes sub-section 1.  I doubt if
the entire statute book could be successfully searched for a sentence of
equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Kent Karlsson


 The NFD decompositions of spacing marks is alredy defined as a SPACE
 plus a non-spacing combining character. 

Philippe, please!  Those are *compatibility* decompositions. The normal
form NFD only uses *canonical* decompositions. And there is no such
thing as NFD decompositions.

/kent k

AL32UTF8 Vs UTF8

2003-08-14 Thread Jay Chandru

Greetings,

We are using Oracle9i with application tier as 11i.

I wanted to know the differences between AL32UTF8 and UTF8. My database (oracle) will be in AL32UTF8 format. Will the applications that require multibyte characters work as they are functionin in UTF8 format.

Would be great if anybody can gimme a comparision on AL32UTF8 and UTF8

Also pls list requirement of any 3rd party softwares for code page conversions in case of AL32UTF8

Thanks in advance,
-Jay
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 06/08/2003 03:38, Kent Karlsson wrote:

Kenneth Whistler wrote:

 

Kent Karlsson said:

   

I see no particular *technical* problem with using WJ, though.  In
contrast
to the suggestion of using CGJ (re. another problem) 
 

anywhere else but
   

at the end of a combining sequence. CGJ has combining class 
 

0, despite
   

being invisible and not (visually) interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.
 

Why? 
   

See above (I DID write the motivation!). Combining classes are generally
assigned according to typographic placement. Combining characters
(except those that are really letters) that have the same placement,
and interfere typographically are assigned the same combining class,
while those that don't get different classes, ...
Not true, as we have seen for Hebrew. It's supposed to be true, but 
isn't, and the problems can't be fixed.

... and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. a, ring above, cgj, dot below supposed to be different from
a, dot below, cgj, ring above (supposing all involved characters
are fully supported), when a, ring above, dot below is NOT
supposed to be much different from a, dot below, ring above
(them being canonically equivalent)? ...
There is no difference when the characters really do not interfere 
typographically. But when they do, there is a real and, in some 
languages, meaningful distinction.

...

... the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.
 

OK, let's confine it to those specific uses where it is really needed, 
e.g. to get round the problem of combining characters with different 
combining classes which actually do interact typographically, and 
perhaps there was another one being suggested. I have no problem with 
that - as long as the list of permitted uses is not set in stone, so 
that new uses can be approved when they are discovered. But there is no 
good reason to object to its use in those cases where it is needed, 
simply because in many other cases it is not needed.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

On Saturday, August 09, 2003 12:49 AM, Michael Everson [EMAIL PROTECTED] wrote:

 At 14:22 -0700 2003-08-08, Kenneth Whistler wrote:
 
  Philippe, you are tilting at windmills, here. There is no chance
  that the UTC is going to consider such a character, in my
  assessment, let alone give it the properties you suggest.
 
 Nor WG2 either.

Why that? Because I suggest something that some other may think
as useful to fill a large gap in Unicode for spcing diacritics, but I'm
not trusted enough due to my errors or confusions here, so that this
suggestion would be endorsed by more serious UTC or WG2
members?

I admit that the properties of such character can be discussed, and
is possibly not necessarily a Sk symbol, but a Lo letter, in which
case the name INVISIBLE LETTER may be appropriate (where
it could also fill the gap for Hebrew Yerushala(y)im, but this is a
possibly distinct function for a missing letter in phonology).

Why do you think it is stupid to have a single carrier character that
would avoid adding new spacing diacritics, when the standard
combining diacritics could be used without less quirks like
defective sequences just to produce the desired effect?

If you think that spacing diacritics are stupid, why then are they
given these properties and not deprecated (no more recommanded)
in the standard, in favor of the SPACE+diacritics sequences, which
are really not equivalent to spacing diacritics used as symbols
(sometimes described also as MODIFIER LETTER which is
very misleading according to their gc=Sk property) and as base
characters (to which other diacritics can be applied) ?

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Conflicting principles

2003-08-14 Thread Philippe Verdy

On Friday, August 08, 2003 9:16 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 On 07/08/2003 13:57, John Cowan wrote:
 
  ... But an immediate problem comes to mind: what if there is a
  line break between the two base characters?
 
 What if there is a line break between the two characters joined by a
 double width combining character?
 
 Are arbitrary line breaks in the middle of words actually permitted
 anyway? Presumably any line breaking property of the first base
 character of the pair is cancelled anyway. That leaves a problem only
 if the second base character has a line break before possibility.
 Well, that could just be treated as one of the sequences we were
 discussing yesterday, not illegal Unicode but its rendering is
 undefined. 

Such break in a middle of a multiple width diacritic exist in some
notations, and are not considered horrible typography.
Just look at musical notations where a upper horizontal parenthesis
is used to group some elements (sorry I don't know how you name
it exactly in English or Italian), despite there's a measure break
in the middle, which may span to the other musical line: you end
up with two parts for the same diacritic broken across the lines.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Handwritten EURO sign

2003-08-14 Thread Pim Blokland

Michael Everson schreef:

 More
 horrifying is the idiotic euro is immune to grammar error which
 continues to be broadcast daily by our television and radio
stations,
 all because people with power lacked the moral courage to say
oops,
 yeah, that was the wrong interpretation of the Directive which was
 intended to ensure clean typography. Sigh.

I have absolutely no idea what you are talking about.

Pim Blokland

Re: Pigpen/Masonic/Poundex

2003-08-14 Thread Michael Everson

At 18:49 +0200 2003-08-08, Chris Jacobs wrote:

This seems to be a clear difference from colorful scripts, where I think
there is an agreement about which glyph represents which sound.
So I think the analogy between pigpen and colorful scripts does not hold.
Two gifs on two websites does not constitute actual use of a script, 
nor a need for real users to interchange it.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread John Hudson

At 05:27 PM 8/8/2003, Kenneth Whistler wrote:

Because the mechanism for doing so -- application to SPACE or
to NBSP -- has been specified by the standard for a decade now.
True enough, but I'm also a bit concerned about this mechanism because 
white space characters are another pesky thing that not all applications 
paint. TEX, perhaps most famously, uses its own 'glue' instead of the space 
glyph in the font. And what happens when word spacing is expanded or 
contracted in text? The diacritic mark ends up being shoved to the left or 
right of where it should be. Of course, if the space glyph is not painted 
you have to rely on blind offsets for mark positioning, because unpainted 
glyphs can't be found for smart positioning lookups. As someone who cares 
about typography, I don't like blind offsets because they don't offer 
precise enough control: I would much rather have a mechanism that I can 
reliably and precisely use with glyph positioning lookups. I'm not 
suggesting that the use of space/nbspace for this purpose should be 
deprecated, only that an alternate mechanism would be useful for those who 
want more control of how combining marks are rendered on a blank base.

A similar but not identical issue was raised by Peter Constable when we 
were talking about Qere vs Ketiv readings in Biblical Hebrew. There are 
cases in which vowels are applied to ellided consonants, which in some 
texts results in marks applied to a blank base in mid-word. In this case, 
my concern about using space or nbspace is that these imply a word break 
where there is not, in fact, any break in the word: the blank base is part 
of the word.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

RE: Conflicting principles

2003-08-14 Thread Michael Everson

Ken's point of course is that however bizarre the backing store for 
Sindarin and English Tengwar modes may be, combining characters per 
se must follow their base characters no matter what.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Conflicting principles

2003-08-14 Thread Philippe Verdy

On Thursday, August 07, 2003 11:29 PM, Michael Everson [EMAIL PROTECTED] wrote:

 Ken's point of course is that however bizarre the backing store for
 Sindarin and English Tengwar modes may be, combining characters per
 se must follow their base characters no matter what.

Even if that breaks the logical analysis of text?
How does the Sindarin mode affect the line or word breaking rule for example:
suppose that the combining character is coded after the next logical base character, 
would it be valid to break at this base character and thus send the combining vowel to 
the next line, where in fact what is intended is to use a vowel carier for the 
combining character logically attached to the previous base character?

I don't know Tengwar's Sindarin mode enough to see how word breaking can affect the 
interpretation of text. But preserving the logical ordering of letters seems much more 
important for actual text encoding than just being constrained by combining rules that 
were created taking into account only the first encoded scripts for Latin, Greek, 
Cyrillic, Hebrew, Arabic and Hiragana/Katakana scripts that use combining characters.

The response to such answer would come in relation with other still unencoded scripts; 
you quoted some of them which have similar difficulties, and that are neither extinct, 
and have a huge amount of existing texts to represent, including many modern languages 
that are only partly litterated and that would benefit from a written litteracy form 
according to similar languages spoken and written in a cultural region, notably in 
Africa, Central Asia, and Oceania (regions that have suffered for too long of an 
absence of an easy to adapt and learn writing system for minority languages).

Even in India, there is still no consensus for the use of the ISCII-based writing 
system for Brahmic scripts, and the current work on Tibetan or on Indo-Aryan languages 
show that the currently officially adopted system does not fit the cultural demand of 
minority users, because the official writing system does not fit very well their 
language.

There will certainly not be a huge revolution in writing systems (families of scripts 
with similar behaviors), but existing systems will still continue to be adapted to fit 
local cultural demands for minorities and specialized areas, that a too strict 
encoding model proposed now by Unicode cannot fit well. Some examples include text 
that use a non linear layout, where the layout carries important semantics (examples 
are numerous for hieroglyphic languages, one of which having modern use and not 
fitting well with Unicode which often fails to represent clusters with simple 
combining sequences assuming a base character and diacritics).

If one looks at Korean jamos, the problem has only been solved by actually *reducing* 
the number of layout combinations, and creating artificial letters (jamos) for some 
combinations that are logically perceived as multiple letters (for example the 
SSANGKIEOK jamo, which is really a pair of KIEOK letters), which are only partly 
decomposed and represented as their component letters, whose composition layout is 
greatly simplified but does not match correctly the historic Hangul clusters.

Probably the same thing can be said about Han ideographs, constantly updated to 
present new clusters, and even Hiragana/Katakana clusters currently represented as 
single codepoints when in fact they are really composed, and constantly enriched with 
new clusters notably in the scientific area. To allow users to create their own 
clusters, Unicode has added ideographic description characters which are controls used 
as prefixes for a combining sequence containing base letters. This is already a 
break in the axiomatic view of combining sequences made with a single base letter.

Other areas where combining sequences are not following this model is of course the 
Hangul script, the CGJ character used between two base letters, double (width) 
diacritics, ... Really there already exists many exceptions to the axiomatic view of 
combining sequences, and I don't see why there could not exist a model allowing new 
classes of combining characters attached to a *following* base character, such as for 
Tangwar Sindarin vowels (if we suppose that Sindarin vowels are encoded separately 
from Quenya vowels, because of their distinct combining properties, and because the 
Tengwar script is really a family of related scripts, which contains much more 
differences than between Latin, Greek and Cyrillic separate scripts).

So one cannot be satisfied by the currently limited model with a single base letter 
and combining modifiers, which would create an artificial hierarchy between letters, 
that does not fit the cultural semantics of the encoded language.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: John Cowan [EMAIL PROTECTED]

 Peter Kirk scripsit:

  So far so good, but when I get to an accent with no predefined spacing
  variant, I have a problem!

 No you don't.  If you want to say Seagull is the diacritic used to
 represent linguolabial sounds in the IPA, then you just encode U+0020
U+033C
 at the beginning of the next line.  If the seagull doesn't line up
properly,
 you complain to the foundry or the implementor.

It's true that you can complain to a foundry for an inappropriaet glyph
positioning
but not to an implementor of other components dealing with text boundaries.
The inaccuracies we are spaeaking about are not in the glyph representation
but in text handling algorithms, these last ones being clearly part of the
Unicode
standard, unlike font problems.

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Philippe Verdy

On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 OK, what kind of markup should I use, in any well-known markup
 language, to ensure that an isolated diacritic is centred in the
 space between the words before and after it?

In plain text, I think that this encoding:
...endOfWord1, SPACE, SPACE, diacritic, SPACE,
startOfWord2...
is what you need, as it creates the following combining sequences:
...endOfWord1, SPACE, SPACE, diacritic, SPACE,
startOfWord2...

If you don't want any space around the diacritic which must be displayed
isolated but in the middle of a word, the following would work:
...endOfWord1, SPACE, diacritic, startOfWord2...
Here the SPACE is not a break opportunity, but just the base character
for the diacritic inserted. What is missing in the standard is defining the
property of such SPACE+diacritic sequence: normally it inherits the
properties of the base character, and properties of diacritics are ignored.

But when using a SPACE or NBSP base character new properties may
be needed. If there's still a break opportunity on the base SPACE of a
combining sequence, it is not clear where the break occurs: before the
SPACE (i.e. before the combining sequence), or after the diacritic (i.e.
after the combining sequence)?

I think that the second option applies here, i.e. the base SPACE would
create a break opportunity at end of the whole combining sequence
made with a SPACE and the following combining characters (including
CGJ if needed to fix canonical ordering).

Another similar case would be the use of a isolated nukta (which
normally modifies a following base character): the sequence
nukta, SPACE is a single combining sequence with a break
opportunity. So a sequence like nukta, SPACE, acute accent
would be unbreakable but would include a break opportunity at its
end, unless it is followed by a NBSP.
And the sequence nukta, NBSP, acute accent would also be
unbreakable either in the middle or on both ends.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Jony Rosenne

I would like to point out that with all due respect, how particular fonts or rendering 
engines behave is only marginally relevant to the Unicode list. I think that we should 
deal only with the Unicode specification.

A particular implementation or many implementations may not behave as expected, and 
then may be either conformant or non-conformant, or may behave as expected and still 
be either conformant or non-conformant. Messages such as the attached help the 
discussion of the specification only as illustrations and as a basis for discussing 
conformity.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk
 Sent: Wednesday, August 06, 2003 12:11 PM
 To: Curtis Clark
 Cc: Unicode List
 Subject: Re: Display of Isolated Nonspacing Marks (was Re: 
 Questions on ZWNBS...)
 
 
 On 05/08/2003 16:59, Curtis Clark wrote:
 
  on 2003-08-05 15:31 Peter Kirk wrote:
 
  Thank you, Mark. This helps to clarify things, but still doesn't
  explicitly answer my question of how to encode a sentence 
 like In 
  this language the diacritic ^ may appear above the letters 
 ..., but 
  instead of ^ I want to use a combining character  and want to 
  display exactly one space before the combining character - do I 
  encode two spaces or one?
 
 
  In this language the diacritic   may appear above the letters...
 
  Two spaces, at least in Thunderbird Mail.
 
 
 Thank you. Well, this sort of works. I looked in various 
 fonts. In some 
 of them the diacritic is centred in the space between the words 
 diacritic and may, but in others it is offset to the left or the 
 right. The problem is that the space is wider than the 
 diacritic, which 
 confuses things, and all the more so no doubt if it expands for 
 justification. NBSP would probably be a better choice in that 
 it is less 
 likely to expand. But what I am looking for is a diacritic 
 holder which 
 is defined to be only as wide as the diacritic. On the principle that 
 base characters expand to fit the width of the diacritic,  ZWSP or, 
 better, a real (rather than misnamed) zero width no break space would 
 seem to have the right properties for that.
 
 -- 
 Peter Kirk
 [EMAIL PROTECTED]
 http://web.onetel.net.uk/~peterkirk/

Re: Handwritten EURO sign

2003-08-14 Thread Michael Everson

At 08:55 -0700 2003-08-05, Doug Ewell wrote:

The original legislative attempt to dictate the exact proportions (and
even color) of the euro sign, regardless of the font in use, was just
silly.
That is very old history, as detailed on my website 
(http://www.evertype.com/standards/euro/euroglyph.html). More 
horrifying is the idiotic euro is immune to grammar error which 
continues to be broadcast daily by our television and radio stations, 
all because people with power lacked the moral courage to say oops, 
yeah, that was the wrong interpretation of the Directive which was 
intended to ensure clean typography. Sigh.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Assume everything on this list is ignored

2003-08-14 Thread Jill . Ramonsky


Isn't the very notion of submit[ting] a FAQ question a contradiction in
terms? Surely, one merely ASKS a question. If enough people ask the same
question, we may then classify it as frequently asked.

It's like this. Newbies want to find things out. So they read books, and
look around on the web. Eventually, they'll encounter some point of
confusion they can't resolve by their own research (or don't have time to
thoroughly research), so they will then find some forum to join in the hope
of finding somebody there who will know the answer.

This forum -- indeed, ANY forum -- will have questions asked on it. Some of
them may be asked frequently. These are, by definition, Frequently Asked
Questions _of the forum_. Forum FAQs are generally put together by long-term
members of forums who are sick of having to answer the same question over
and over again to all these damn newbies, or by other long-term members who
simply wish to cut down the traffic on the list.

Now this is, in fact, rather curious. Because the web page
http://www.unicode.org/consortium/distlist.html implies that _this_ list
(described as the Unicode Public E-mail List) is _the_ place for the
public to go to pose questions to the community of Unicode users. In THE
SAME PARAGRAPH that web page says as a courtesy to others on the list,
please check the ... Frequently Asked Questions [at
http://www.unicode.org/faq/];. (Which I did).

Now, if it is true, as Mark Davis suggests, that the Frequently Asked
Questions list at http://www.unicode.org/faq/; is unrelated to this list,
then:

(1) This should be made clear on the consortium's web page
(http://www.unicode.org/consortium/distlist.html), which currently implies
that the stated FAQ is the FAQ _of this list_, and

(2) This list should have a FAQ of its own, independent of the consortium's
FAQ, and maintained by long-term members of this list (i.e. by those who are
in a position to know which questions are, in fact, frequently asked).

...and for what it's worth, the consortium's submission form at
http://www.unicode.org/reporting.html seems (a) difficult to find without
knowing the URL (I couldn't find it anyway, at least not by starting at
www.unicode.org and clicking on links from there), and (b) intimidating --
it is not worded to encourage the I don't understand feature XYZ type of
question from the public. I am therefore forced to wonder who actually
_asks_ these frequently asked questions of theirs.

Just my thoughts. Please don't take of this too seriously.

Jill



-Original Message-
From: John Cowan [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 12, 2003 1:35 AM
To: Mark Davis
Cc: [EMAIL PROTECTED]
Subject: Re: Assume everything on this list is ignored (was Re: Newbie
Question - what are all those duplicated characters FO R?)


Mark Davis scripsit:

 If you want to submit a FAQ question

The relation between Unicode and ISO/IEC 10646

2003-08-14 Thread Jony Rosenne

As far as I know, there are many topics not covered by ISO, for example
(Bbi-directional behavior.
(B
(BJony
(B
(B -Original Message-
(B From: [EMAIL PROTECTED] 
(B [mailto:[EMAIL PROTECTED] On Behalf Of souravm
(B Sent: Tuesday, August 12, 2003 8:40 AM
(B To: unicode
(B Subject: SPAM: The relation between Unicode and ISO/IEC 10646
(B 
(B 
(B 
(B Hi All,
(B 
(B As I know, historically ISO/IEC 10646 (UCS) is from ISO and 
(B Unicode was defined by a consortium of major American 
(B computer manufacturers. From version 1.1 on, Unicode is 
(B scrupulously kept compatible with ISO/IEC 10646 and its 
(B extensions. The latest fact I found that Unicode 4.0 
(B character repertoire$B!!(Bcorresponds to ISO/IEC 10646:2003.
(B 
(B Also I understand that from Unicode 2.0 onwards Unicode 
(B covers all the code points of UCS-4. Now, my doubt is, in the 
(B current situation, 
(B - What is the need for continuing both of these two different 
(B coded character sets in parallel? Why can't they be merged? 
(B - Is there any additional issues/points taken care of by 
(B ISO/IEC 10646:2003 which are not there in Unicode 4.0 and vice versa ?
(B 
(B 
(B Any funda on this will be really appreciated.
(B 
(B Regards,
(B Sourav
(B 
(B 
(B

RE: Conflicting principles

2003-08-14 Thread Kent Karlsson


  Collation isn't really based on combining sequences (even though UTS
10
  specifies a certain spanning over non-blocking (combining)
 
 This is a very ignorant question:  where in your public documentation
 are these issues discussed? 
...
 I still don't understand even what happens with basic 
 collation in Hebrew, what 
 effect the shin / sin dots have.  

Ignored at level 1, considered at level 2. From the 14651 data file:

U05C1 IGNORE;SHINP;MIN;U05C1 % HEBREW POINT SHIN DOT
U05C2 IGNORE;SINPT;MIN;U05C2 % HEBREW POINT SIN DOT

 And, of course, I don't 
 understand any of the 
 more complicated issues either, such as what will happen when 
 your database 
 sorts un-pointed Hebrew epigraphy (just the consonants) and 
 pointed medieval 
 Hebrew (all the jots and tittles added).  

Re. collation, see UTS 10, and associated data files, and if you're 
really interested, see ISO/IEC 14651 (sort of a parallel to UTS 10,
but different), and its data file.

/kent k

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 13/08/2003 15:54, Jony Rosenne wrote:

Suggested but not accepted.

I am inherently suspicious when pressure is being exerted to decide complex
and difficult questions in a hurry.
Jony
 

Jony, I am not trying to hurry anything. I am putting a lot of time and 
effort into trying to reach proper decisions on these complex and 
difficult questions. What I am not prepared to do is to accept a quick 
answer that the lowest common denominator of printers don't bother to do 
X, therefore we need not bother to support X in Unicode although X is a 
definite requirement of a significant subset of Hebrew users.

If you have problems with this particular suggestion, let's discuss them 
on the Hebrew list.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (problems with UAX#29)

2003-08-14 Thread Doug Ewell

Philippe Verdy verdy_p at wanadoo dot fr wrote:

 Note that these two ZW and SP classes of characters are *normative*.
 Another proof that SPACE+diacritics is really a hack causing lots of
 problems in the Unicode main standard and its standard annexes.

Has it occurred to anyone yet that the very *concept* of spacing
diacritics is a hack?  Spacing diacritics are used to conduct a sort of
meta-discussion about characters, as in A base character o is combined
with an acute accent  to create .  They are not part of the normal
writing systems of most natural languages.

It is as if I were describing the two typical glyphs used for lower-case
g, the one with one bowl and the one with two bowls, but actually
showing the separate, constituent pieces of the glyphs instead of using
words to describe them.  They are interesting things to talk about, but
not necessarily things that need to be encoded in plain text.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: ADO, SQL-Server and VB6

2003-08-14 Thread Jon Hanna

I might be able to help. Two questions:

1. How firmly have you tracked down the point at which this conversion
happens?

2. What is the datatype in the database? (text BLOB?, ntext BLOB? varchar?)

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Kent Karlsson


Michael wrote:
 The Name Police reject this utterly. ZERO WIDTH cannot have an 
 expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
can grow to have a visible width when justified? And it has the
NamesList comment:
* nominally zero width, but may expand in justification

(But U+0082, BREAK PERMITTED HERE, which otherwise is very similar
to ZWSP according to 6429, does apparently not allow such stretching...)

/kent k

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 11/08/2003 16:06, Mark Davis wrote:

Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.
[This may not be true: I don't actually read his email, because the
information content per line falls below my email threshold; not to
say that there may not be information there, but I cannot afford to
take the time to find out -- sadly, one of my character flaws.]
All of the text boundaries preserve grapheme cluster boundaries, which
never separate a base character (including space and NBSP) from a
following NSM. In addition, each of the boundary types above grapheme
clusters make some statement about the behavior of the grapheme
cluster. For example, with line boundaries a SPACE + NSM has a special
behavior. With the others, the behavior is the same as the base
character.
As Ken points out, in any event these are default boundaries, and can
be tailored. That being said, if the normal behavior of the default
can be improvied, and someone has a concrete proposal for doing so,
then it can be considered.
Mark
__
http://www.macchiato.com
  Eppur si muove 
 

I was aware that there should not be a line break or word break between 
the space and the NSM, although I suspect that many implementers will 
not be aware of this, or at least will not test for it properly and so 
treat any space as a word break and a line break opportunity. As I just 
wrote, this requirement to test all spaces for following NSMs is a 
significant inefficiency built into the standard.

But there is still a problem if there is considered by default to be a 
word break and a line break opportunity AFTER the NSM. I would suggest, 
as a candidate for a concrete proposal, that the default behaviour be 
adjusted so that there is no word break or line break opportunity here 
either.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Unicode Technical Note added

2003-08-14 Thread Rick McGowan

A new Unicode Technical Note on Deterministic Sorting is now available:

http://www.unicode.org/notes/tn9/

Unicode Technical Notes provide for the publication of information that  
may be of interest to implementers or readers of the Unicode Standard, or  
to users of programs which implement the Standard.

The complete list of available notes is accessible here:

http://www.unicode.org/notes/


Regards,
Rick McGowan
Unicode, Inc.

Roadmap-Mandaic, Early Aram., Samarit Alternative Mel Gibson

2003-08-14 Thread ekeown


 Elaine Keown
 still in Madison WISC

Hello,

Responding again to the deep interest in Aramaic expressed
on the list, I am writing with a suggested preliminary 
Alternative or possibly Countercultural version of the 
Roadmap and a New, Improved Acronym for EUSAS (Egyptian, 
Akkadian, Ugaritic, Semitic Alphabetic and Syllabic)...

And, slightly OT, I imagine you all are also waiting 
breathlessly for the new Mel Gibson movie which is, of course, 
going to be in ARAMAIC with NO subtitles, not even Unicode-
conformant ones.  If Aramaic is trendy in LA, when will it
hit Mountain View?

Here is the beginning of an Alternative Roadmap.

_Suggested Afroasiatic Roadmap Blocks_
Egyptian Hieroglyphics---the Aramaic glyphs for Aramaic
   in hieroglyphics (from Wadi El-Hol) are included
Egyptian hieratic---the Aramaic ones (see wadi, above) 
   are included
Egyptian demotic--Aramaic demots are included
The Cuneiform Block --- the one Aramaic cuneiform is included
  (and also the Arabic in cuneiform)

_CEUSAS_
Instead of describing the not-yet-encoded Middle Eastern/
N and East African scripts as EUASAS, I suggest 
CEUSAS ---Cuneiform, Egyptian, Ugaritic, Semitic Alphabetic and
Syllabic.  Under cuneiform go Sumerian, Akkadian (old Babylonian
and Assyrian), Hittite, Elamite and whatever.  

Cuneiform had a long shelf life--3,400 B.C. to about 125 A.D.

Elaine

Roadmap-Mandaic, Early Aram., Samarit Alternative Mel Gibson

2003-08-14 Thread Michael Everson

I think we will keep the Roadmap as it is for the time being.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

From: Kenneth Whistler [EMAIL PROTECTED]
 It is perfectly reasonable, as I see it, to consider the
 SPACE in a SPACE, NSM sequence to be:
   a. significant
   b. part of the characters in a document that are not markup
  (at least in the cases we are talking about, since the
  problem is not about defining Nmtokens for markup in
  Biblical Hebrew, but rather the representation of the
  Biblical Hebrew document content itself)
  
 So I *still* don't see the problem you are on about, and even
 if there was one, the xml:space attribute could be used to
 require preservation of a particular space.

May be you are forgetting that in XML and HTML, attributes
(including spacial attributes like xml:space can have default
values, and in fact they have such values set in DTD or
schemas to by normative XML applications like XHTML.
Authors are not supposed to modify normative schemas or DTDs,
and so use elements with their default attributes. This is the case
of XHTML as an application of XML, and HTML as an
application of SGML (neither HTML or SGML parsers will
interpret the xml:space attribute, and XML parsers will handle it
only if they are validating documents with their DTD or schema)

Re: [A12n-Collab] Creating fonts for Akan language

2003-08-14 Thread John Hudson

At 12:27 AM 8/7/2003, [EMAIL PROTECTED] wrote:

My desire is to create (make) a set of fonts for the Akan 
language for Windows 2000 to begin with. I have been able to create a 
crude version for my own use but I know that the people of Ghana would be 
very happy to be able to install a standardized version for their own 
use. I would also want to eventually map it to a keyboard, probably with 
extra keys for the two Akan characters.

My problem is:
1. How do I set out to create such a font?
2. How do I use the existing character 0190/025B in such a font?
3. How do I create and get the 15th character accepted in the 
Unicode set?
1. See www.fontlab.com
2. Make a Unicode encoded font (TrueType or CFF OpenType). For use in 
Windows 2000 or XP or other Unicode text processing environments, you do 
not need to worry about 8-bit codepages: so long as the glyphs for these 
letters are mapped to the correct Unicode characters in the font cmap 
table, they will work. If you want to make your own keyboard layout driver 
for Akan, you can use Microsoft's new Keyboard Layout Creator: 
http://www.microsoft.com/globaldev/tools/msklc.mspx
3. The 'open o' character is already included in the Unicode Standard. The 
uppercase letter is U+0186 and the lowercase is U+0254.

A couple of additional comments:

Akan is a tonal language, yes? This likely means that although the Bureau 
of Ghana languages specifies an alphabet of 22 letters there are 
circumstances in which it is necessary to indicate tones to differentiate 
otherwise identical words. For educational and lexicographical texts it may 
also be desirable to indicate nasalisation. This means that simply 
providing glyphs for the 44 upper- and lowercase letters might not be 
sufficient: you may also need dynamic mark positioning.

Microsoft are apparently releasing a number of updates to their core font 
set with upcoming versions of Office and Windows that will include 
extensive African language support.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

Which ancestral links

2003-08-14 Thread John Clews

In message [EMAIL PROTECTED] Michael Everson writes:
Re: Colourful scripts and Aramaic

This is nearly off topic, but I'd be glad of any clarifications, or
references that anybody has.

In message [EMAIL PROTECTED] Michael Everson
wrote in response to Peter Kirk, with a clarification I agree with
mainly:

 People. It [Aramaic] is the widespread offshoot used throughout the
 Middle East that spawned Brahmic and Uighur and other scripts. It
 isn't necessarily the thing you think is confined to three scraps of
 papyrus or whatever.

I'd always been under the impression that the Brahmic script family
and their offshoots, and the Phoenician script family and their
offshoots, developed independently of each other, and although links
between the two families had been suggested by some scholars, many
other scholars disagreed with this suggestion.

Are there some articles which show these links reasonably well, and
if so, which family predated the other?

Also Uighur script (as in Old Uighur, as in Sogdian) has, as a
cursive script, a superficial resemblence to Arabic script (an
offshoot from the Phoenician family) and I imagine that links are
more easy to show. I've never seen a description of the Sogdian
alphabet (i.e. I have never come across one): is there a good article
or URL which illustrates such links?

Best wishes

John

--
John Clews,
Keytempo Limited (Information Management),
8 Avenue Rd, Harrogate, HG2 7PG
Tel:+44 1423 888 432
mobile: +44 7766 711 395
Email:  [EMAIL PROTECTED]
Web:http://www.keytempo.com

Unicode 4.0 is online at last!

2003-08-14 Thread Kenneth Whistler

Well, I've been promising that good things would come
to those who wait. ;-)

At last, the Unicode website has been updated with the
online chapters for Unicode 4.0. See:

http://www.unicode.org/versions/Unicode4.0.0/

Or just go to the Unicode 4.0 link from the home page.

Enjoy.

--Ken

P.S. Just FYI, Peter K., now it is o.k. for everyone to come
back from their August Unicode vacations. Let the
textual criticism begin!

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Jon Hanna [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, August 13, 2003 3:05 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

 On 13/08/2003 04:44, Jon Hanna wrote:

 No, the safe thing to do (and the thing that is done) is to treat the
space
 as a space ignoring the fact that the NMTOKEN contains a combining
 character, this is even safer than your suggestion since it can't
 mis-identify the combining properties of a character.

 OK, it's safe, but it is a misuse of Unicode. As space plus combining
 character is a unit in Unicode, it should be treated as a unit by
higher
 level protocols. If higher level protocols are allowed to do arbitrary
 things within Unicode units, there is no end to the possible
confusion.
 See for example, from Unicode 4.0 chapter 3:

 C7 A process shall interpret a coded character representation
according
 to the character
 semantics established by this standard, if that process does interpret
 that coded character
 representation.

OK, but XML inherits its behavior from SGML and you won't change it.
The only way to bypass this would be to use entitiy references to encode
the base space needed by the Unicode convention, so this is related to
what Unicode defines as a higher level protocol, needed here to bypass
the limitations of basic text. However it still creates a problem within
CDATA sections, which are not supposed to contain entity references.
One needs then to use the XML CDATA escaping mechanism with
another escaping system specific to CDATA sections (which are
formally anonymous text elements and equivalent to them).

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 11/08/2003 06:59, Jon Hanna wrote:

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.
 

Not at all! Imagine a tutorial on a language, which might well list the 
accents used, in a format like this:

` (grave accent) is used with a, e and o, and indicates more open 
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening

So far so good, but when I get to an accent with no predefined spacing 
variant, I have a problem!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 13/08/2003 04:44, Jon Hanna wrote:

No, the safe thing to do (and the thing that is done) is to treat the space
as a space ignoring the fact that the NMTOKEN contains a combining
character, this is even safer than your suggestion since it can't
mis-identify the combining properties of a character.
 

OK, it's safe, but it is a misuse of Unicode. As space plus combining 
character is a unit in Unicode, it should be treated as a unit by higher 
level protocols. If higher level protocols are allowed to do arbitrary 
things within Unicode units, there is no end to the possible confusion. 
See for example, from Unicode 4.0 chapter 3:

C7 A process shall interpret a coded character representation according 
to the character
semantics established by this standard, if that process does interpret 
that coded character
representation.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-14 Thread Jim Allan

Philip Verdy posted:

Could ZWS+combining diacritic may be the best solution for
isolated diacritics in text? 
From http://www.unicode.org/book/ch04.pdf:

 * Such characters may be large enough to effect the placement of
their base character relative to preceding and succeeding base
characters. For example, a circumflex applied to an i may effect
spacing (î), as might the character U+20DD COMBINING ENCLOSED CIRCLE. 
Unless Unicode 4.0 as changed this the words may and might here 
would indicate that ZWSP is not *necessarily* the best solution.

There is no specification about what an application *must* do to be 
conforming in this circumstance, merely indication that an application 
that does expand spacing for the sake of appearance is not 
non-confirming. It is *probably* implied that this is the right way to go.

But I would guess that it would also be conforming for an application to 
not expand spacing at all on ZWSP so that coding of _o_ + ZWSP + 
COMBINING CIRCUMFLEX + _o_ would place the circumflex centered over _oo_ 
with its center point between the two letters.

Either result would be useful for different purposes.

It certainly makes sense that in the case of space characters that have 
a defined width that this width is innate to the definition of the 
character and in such a case should take precidence over the width of 
the normally non-spacing combining character.

I would welcome clear instructions by Unicode on this point where either 
result would be useful in order than applications may be expected to 
produce results that are consistent with each other. :-)

I would think it would be consistant with Unicode for an application to 
shrink the width of normal space followed by a diacritic such as a 
single overdot as exact formatting behavior is not defined in such cases.

Jim Allan

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

- Original Message - 
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Peter Kirk [EMAIL PROTECTED]; Kenneth Whistler
[EMAIL PROTECTED]
Sent: Monday, August 11, 2003 5:39 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

 Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

  Thank you, Ken. Well, you make it sound as if the problems are
  minimal, and that version I can just about accept. But if Philippe is
  correct about what he says about UAX#29 and UAX#14, there are some
  more serious problems. It is certainly highly inappropriate for
  non-spacing diacritics to be considered word boundaries.

 Non-spacing diacritics had better not be word boundaries, otherwise a
 string like Quebec (spelled with U+0301, as here) would be considered
 two words.  I don't have time right now to look up the relevant
 properties and UAX's, but I sincerely hope this is just another
 Philippe mistake and not a general misinterpretation that anyone might
 make.

Not a mistake from me, sorry. From you yes: Peter Kirk probably wanted
to speak about *spacing* diacritics (when coded with SPACE+NSM).
There is no such *spacing* character in Qubec.

Don't accuse me of something I did not say. And be more tolerant please
with what is an obvious typo in the message from Peter Kirk. Instead of
just flaming,  could you better read the message and accept errors and
correct them instead of sending such unconstructive  replied.

Thanks.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Philippe Verdy

On Monday, August 11, 2003 12:27 AM, Kenneth Whistler [EMAIL PROTECTED] wrote:

 A point I keep trying to make, but which often gets overlooked
 by people trying to code Unicode mechanisms for dealing with
 edge cases, is that the design goal of the Unicode Standard is,
 and always has been, to represent *plain text content*. It
 cannot, and should not, IMO, deal with requirements for
 representing arbitrarily fine distinctions of typographical
 detail in all manuscripts and other documents in all writing
 systems of the world.

Spacing diacritics are not on the edge of the standard, when they
are already given a full block and handled there as symbols (not as
letters as suggested in some parts of UAX's), with their own identity
independant of their actual glyphic representation. I am not
discussing about the typesetting of these grapheme clusters but
really about the textual semantics of such combining sequences
with an invisible base character, affecting all their properties and
not fully described in the various standard annexes. Due to the
huge legacy use of SPACE+diacritics in legacy text, and the
already normative parts of some standard annexes, it will be hard
to correct the behavior or change the text of these annexes.
And it's where a new better base character than SPACE could
help solve cleanly the ambiguities.


-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread John Cowan

Peter Kirk scripsit:
 On 13/08/2003 11:09, Philippe Verdy wrote:
 
 ... For this reason, defective
 combining sequences (combining characters without a leading base
 character) should be forbidden (invalid for XML).
  
 
 If there is even the remotest possibility of this happening, we need to 
 know quickly! 

As a member of the XML Core Working Group of the W3C, I can assure you that
there is not even the remotest possibility of it.

-- 
John Cowan  [EMAIL PROTECTED]http://www.ccil.org/~cowan
Is it not written, That which is written, is written?

RE: AL32UTF8 Vs UTF8

2003-08-14 Thread Carl W. Brown




Jay,

Oracle's UTF-8 is not really a valid encoding. It 
encodes surrogates as if they were characters. The kept the old Unicode 
2.x code that only supports BMP to provide sort key compatibility for clients 
who never upgraded to Unicode 3.0 support and are using 16 bit character 
encoding improperly. UTF8 sorts in the same way as the old 16 bit Unicode 
before surrogates. Do not use UTF8 because it is really not Unicode 
conformant with any Unicode standard. Instead use 
AL32UTF8.

Carl


  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]On Behalf Of Jay 
  ChandruSent: Sunday, August 10, 2003 8:58 AMTo: 
  [EMAIL PROTECTED]Subject: AL32UTF8 Vs UTF8
  Greetings,
  
  We are using Oracle9i with application tier as 11i.
  
  I wanted to know the differences between AL32UTF8 and UTF8. My database 
  (oracle) will be in AL32UTF8 format. Will the applications that require 
  multibyte characters work as they are functionin in UTF8 format.
  
  Would be great if anybody can gimme a comparision on AL32UTF8 and 
  UTF8
  
  Also pls list requirement of any 3rd party softwares for code page 
  conversions in case of AL32UTF8
  
  Thanks in advance,
  -Jay
  
  
  Do you Yahoo!?Yahoo! 
  SiteBuilder - Free, easy-to-use web site design 
software

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Peter Kirk

On 12/08/2003 20:28, John Cowan wrote:

Peter Kirk scripsit:

 

2) In attribute values, LF, CR, and TAB characters are normalized to 
spaces.   Not relevant here.
 

This would be relevant if it is legal for the character after LF, CR, 
and TAB to be a combining mark. Is this legal? In this case what was 
previously a defective (but legal) combining sequence would turn into a 
non-defective one, but the intended whitespace would be lost.
   

The point is that there is no such thing as an *intended* line break in
an attribute value; it will *always* be translated to a space before
the application sees it.  (More exactly, line-break characters can
be inserted into attribute values, but only with the use of a numeric
character reference such as #xA;.)
 

Sorry, I'm confused. Are you saying that the input processing will 
translate line breaks into spaces within attribute values, unless 
inserted as #xA; ? Well, I suppose this is fair enough as it is up to 
the user not to enter garbage.

 

Not just a rendering glitch, I suspect. If the combining character is 
combined with the separating space, the space loses many of its 
separating functions, and perhaps keeps a confusing subset of them with 
all sorts of possibilities of error.
   

The space(s) will be used to separate individual tokens at processing
time.  No spacing diacritic (either single-character or space+combining)
is permitted in a NMTOKEN.
 

OK if this is clearly illegal, but this might restrict use of some 
languages in NMTOKEN. Would NBSP + combining be allowed?

 

At best tokens beginning with
combining characters will be unusable. At worst they will crash the 
implementation (and count on someone trying deliberately to do that!). 
   

In effect, the combining character will constitute a defective combining
sequence at the beginning of the individual token.
Stepping away from the letter of the standard for a moment, there is
no real reason to begin a NMTOKEN with a combining character.  It is
only allowed is a result of the miscegenation of SGML concepts with
Unicode ones.
In SGML's original design of tokens, they consisted of letters and digits
(and a few punctuation marks, which functioned as letters).  There were
four kinds: a NUMBER could contain only digits, a NAME could not begin
with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
restrictions.  ID and IDREF had the same syntax as NAME with additional
semantics.  Later, the categories letter and digit were generalized,
by redefining the concrete syntax, to be whatever you wanted, and were
renamed name-start and name characters (technically, a name character
was a letter *or* a digit).
When SGML was simplified to produce XML, only NMTOKEN, the most general
type of token, was kept.  However, in order to keep the semantics of
letter and digit in the Unicode world, letter was extended to be any
letter and digit to be any digit *or* combining character.  That worked
well for ID and IDREF, since treating combining characters as part of
digit prevented them from appearing first, as was only sensible.
Unfortunately, NMTOKENs, since there were no restrictions, became able
to begin with a combining character, though that made no real sense.
To write in a restriction would make it impossible to specify XML's
concrete syntax in SGML terms, which did not allow for three different
classes of characters within tokens.  So we wound up with a basically
useless capability that if used will only cause trouble.
 

There is some potential for real trouble here, if one process outputs an 
NMTOKEN starting with a combining character preceded by a separating 
space, or something else which is changed into a space, and another 
process takes the new space plus combining character as a unit and so 
doesn't recognise the separation. Any hackers and virus programmers 
reading this will soon start flooding the Internet with tokens beginning 
with combining characters in the hope of crashing implementations or 
finding back doors. Of course this wouldn't have been a problem if 
Unicode had never  defined space plus combining character as legal and 
meaningful. But this is not my problem!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Pre-orders of The Unicode Standard, Version 4.0

2003-08-14 Thread Magda Danish $Unicode$

Dear Unicode and Unicore List Subscribers,

The release of the Unicode Standard, Version 4.0 is right around the
corner. There is still time to place your individual or group orders and
to get the book sent to you directly from the publisher, fresh off the
press.
Anyone placing bulk orders is highly encouraged to do so by August 20 as
this will substantially speed up the delivery time. Full members of the
Consortium receive 20% discount, Associate and Specialist members
receive 10% off the list price of $74.99.
To order, please use the the book order form at
http://www.unicode.org/book/bookform.html


Regards, 
Magda Danish 
Administrative Director 
The Unicode Consortium 
650-693-3921

1 2 3 >

1 - 100 of 202 matches

Mail list logo