Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-14 Thread Mark Davis
First, you should check again, since a significant amount of work was
done in modularization in 2.6.

Second, the phrase IBM forgot to modularize ICU is misleading, at
the least. Unlike some people, who appear to have unbounded time and
energy for, say, writing emails, we have to carefully pick and choose
where we spend our time. Whether very fine-grained modularization is
important depends a great deal on the client's requirements, and must
be traded off against the many other things we could be doing with our
time.

Third, ICU4J is a source product. Saying that it is impossible to
integrate the ICU's Normalize... is also misleading, since one can
clearly modify source to remove dependencies on code one doesn't want
to include, if it is not core to the functionality. (Of course, it may
vary in amount of effort that is required.). And transliterators are
not, in any event, required for Normalization.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Philippe Verdy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, July 14, 2003 11:13
Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish
and Azeri, was: Accented ij ligatures)


 On Monday, July 14, 2003 5:34 AM, Mark Davis [EMAIL PROTECTED]
wrote:

  ...
   Of course
   Java already includes some parts of ICU, but other things are in
   ICU4J are difficult now to integrate in Java, simply because IBM
   forgot to modularize ICU so that it can be integrated slowly.
   Accepting ICU4J as part of the core is a big decision choice,
   because ICU4J is quite large, and there are certainly developers
   for Java that would not accept to have 1 aditional MB of data
and
   classes loaded in each JVM (particularly because the integration
   of ICU would affect a lot of core classes for the Java2 platform
   now also used for small devices).
  ...
   For example, it is impossible to integrate the ICU's Normalizer
   class in Java without also importing the UChar class and all its
   related services for UString, such as transliterators, and
  ...
 
  You are very misinformed about ICU4J.

 I hae tried several times to do it. It does not work: you may
 effectively remove some tables your don't need, but trying
 to extract just the normalizer is a real nightmare. I tried it
 in the past, and abondonned: too tricky to maintain, and I
 retried it recently (one month ago, from its CVS source) and
 this was even worse than the first time.

 I know that there's now a recent announcement (less than 1
 month ago) for its modularization, but it's true that I did not
 check the new modularized sources. So my application
 of ICU4J is still only when I can accept the whole package,
 as maintaining a stripped-down customization is too tricky.

 But may be this has changed, I just updated my ICU sources
 from CVS. I'll recheck it to see if a ICU Light version can be
 created (which would only keep the core features, without the
 support for tailoring rules compiled at run-time).

 -- 
 Philippe.
 Spams non tolrs: tout message non sollicit sera
 rapport  vos fournisseurs de services Internet.







Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-13 Thread Mark Davis
...
 Of course
 Java already includes some parts of ICU, but other things are in
 ICU4J are difficult now to integrate in Java, simply because IBM
 forgot to modularize ICU so that it can be integrated slowly.
 Accepting ICU4J as part of the core is a big decision choice,
 because ICU4J is quite large, and there are certainly developers
 for Java that would not accept to have 1 aditional MB of data and
 classes loaded in each JVM (particularly because the integration
 of ICU would affect a lot of core classes for the Java2 platform
 now also used for small devices).
...
 For example, it is impossible to integrate the ICU's Normalizer
 class in Java without also importing the UChar class and all its
 related services for UString, such as transliterators, and
...

You are very misinformed about ICU4J.

Mark
__
http://www.macchiato.com
  Eppur si muove 

- Original Message - 
From: Philippe Verdy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 12, 2003 14:45
Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish
and Azeri, was: Accented ij ligatures)


 On Saturday, July 12, 2003 4:17 PM, Jony Rosenne
[EMAIL PROTECTED] wrote:

  What has iw to with Hebrew?
 
  I wasn't involved with the change, but I'm glad it was done. Java
and
  other systems probably still use it because they never bothered to
  check the latest version of 639. I know for certain that this was
the
  case with one of the major computer vendors.

 In the case of Java, I don't think so. Sun has certainly maintained
the
 language code simply to avoid breaking existing localizations to
 Hebrew of Java-written software, waiting probably for a better way
to
 locate locales than the fixed locales path resolution algorithm
which
 is part of its core Classes since the beginning.

 As long as Java core classes will not use a locale resolver that
allows
 tuning the resolution algorithm used to load resource bundles, while
 also maintaining the compatibility with the existing softwares that
 assume that Hebrew resources are loaded with the iw language code,
 Sun will not change this code.

 In IBM ICU4J, there is such an extended resolver, but Sun takes a
 long time to approve such proposals, and have it first accepted,
 documented, balloted and voted in its JCP program. Of course
 Java already includes some parts of ICU, but other things are in
 ICU4J are difficult now to integrate in Java, simply because IBM
 forgot to modularize ICU so that it can be integrated slowly.
 Accepting ICU4J as part of the core is a big decision choice,
 because ICU4J is quite large, and there are certainly developers
 for Java that would not accept to have 1 aditional MB of data and
 classes loaded in each JVM (particularly because the integration
 of ICU would affect a lot of core classes for the Java2 platform
 now also used for small devices).

 For example, it is impossible to integrate the ICU's Normalizer
 class in Java without also importing the UChar class and all its
 related services for UString, such as transliterators, and
 advanced features such as the UCA tailoring rules run-time
 compiler. Some ICU open-sourcers, as well as its users seem
 to think now that the modularization of ICU is an important but
 complex project.

 -- 
 Philippe.
 Spams non tolrs: tout message non sollicit sera
 rapport  vos fournisseurs de services Internet.







Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter_Constable
 Where does the fact of saying that a Grapheme Disjoiner...

The character you should be referring to is not a new character GDJ, but 
rather is the existing ZWNJ, the functions of which include prevention of 
a ligature.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Philippe Verdy
On Saturday, July 12, 2003 6:51 AM, Doug Ewell [EMAIL PROTECTED] wrote:

 Philippe Verdy verdy_p at wanadoo dot fr wrote:
 
  Good luck with ISO language codes which does not even
  define them, and contain many duplicate codes even in
  the Alpha-2 space (he/iw, in/id), or unprecize codes
  matching sometimes very imprecize families of languages
  overlapping other language codes...
 
 The codes iw for Hebrew and in for Indonesian were deprecated
 FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
 duplicates of he and id.  The Registration Authority deprecates
 such codes, rather than deleting them, for backward compatibility with
 any data that might contain the old codes.

I was sure also that iw was not used today, until I found that it is
still used in Java on Windows, for legacy reasons... Creating a resource
bundle in Hebrew with the code he was simply... ignored. So I had to
rename it to iw.

Shamely, on Linux or various Unixes the correct code to use for locales
varies, and it comes from the user-environment settings, actually setup
by a system profile, most of the time... Users that want to get the
benefit of existing locales for Hebrew will constantly need to change
between he and 'iw. The normal installation solution is still today
to create a file link between he and iw resources, so that they both
can be used.

I was really disappointed when I saw that these legacy language codes
were not simplifiable the way we think, by ignoring iw and in, and still
today, Java does not offer a way to create links at runtime to resolve
locales with equivalent ids, without duplicating resources or creating
special rules with: if ( code=he|| code=iw )
(don't forget that Java has also run-time resources with no files)...




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter Kirk
On 11/07/2003 11:18, Philippe Verdy wrote:

# T: special case for uppercase I and dotted uppercase I
#- For non-Turkic languages, this mapping is normally not used.
#- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters.
 

snip

Is that what is called a character subset for a scripted language family? Well I don't like the term Turkic to name it. I prefer the more common Altaic Latin alphabet, seen as a standard subset of the Latin script, with additional properties.

May be Unicode should not try to use language codes for families of languages, but it could define representative subsets of characters which may contain characters from several scripts, but would be minimized according to the tradition of a family of languages. Such families seem evident from the current ISO-8859-* and Mac/Windows/DOS charsets.

-- Philippe.

 

Thank you, Philippe. Well, I am glad to read not normally used rather 
than must not be used as this allows mapping T to be used for other 
languages when appropriate.

I also don't like the word Turkic here. This is a linguistic term for a 
language family, see 
http://www.ethnologue.com/show_family.asp?subid=710. Turkish and Azeri 
are Turkic languages, but there are many Turkic languages which don't 
use this case mapping, either because they use other alphabets 
(Cyrillic, Arabic, occasionally Hebrew, perhaps even Greek) or because 
they use a Latin alphabet with the regular case mapping as in Uzbek and 
Turkmen. There are also some non-Turkic minority languages which need 
the T case mapping. Altaic Latin alphabet is a reasonable alternative, 
although again Altaic is a language family name, covering Turkic, 
Mongolian and Tungus, see 
http://www.ethnologue.com/show_family.asp?subid=709, and as far as I 
know mapping T is not needed for any Mongolian or Tungusic languages.

Does anyone know of a good resource on the web, or elsewhere, listing 
the alphabets used for different languages around the world? I know a 
project was attempted a few years ago at least for Europe. It would be 
useful to have this kind of data available somewhere even with no 
official status.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Michael Everson
At 03:25 -0700 2003-07-12, Peter Kirk wrote:

Does anyone know of a good resource on the web, or elsewhere, 
listing the alphabets used for different languages around the world? 
I know a project was attempted a few years ago at least for Europe. 
It would be useful to have this kind of data available somewhere 
even with no official status.
http://www.evertype.com/alphabets
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Patrick Andries


Samedi 12 juillet  6h51, Doug Ewell [EMAIL PROTECTED] crivit :

 The codes iw for Hebrew and in for Indonesian were deprecated
 FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
 duplicates of he and id.  The Registration Authority deprecates
 such codes, rather than deleting them, for backward compatibility with
 any data that might contain the old codes.

Just out of curiosity, why was  iw  deprecated ? Seems perfectly fine to
me.
And why was  he  chosen (Herero, Hemba, Hellenic Greek) ?

P.A.





Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter Kirk
On 12/07/2003 04:18, Michael Everson wrote:

At 03:25 -0700 2003-07-12, Peter Kirk wrote:

Does anyone know of a good resource on the web, or elsewhere, listing 
the alphabets used for different languages around the world? I know a 
project was attempted a few years ago at least for Europe. It would 
be useful to have this kind of data available somewhere even with no 
official status.


http://www.evertype.com/alphabets
Thank you, Michael. I knew you had this information, of course, as I 
helped to provide it, but I didn't know where it was now. This is of 
course restricted to Europe as you have defined it, and is not 
exhaustive for Turkey. Also it doesn't include recent Latin alphabets 
for minority languages of Azerbaijan, as used in schools to a rather 
limited extent, perhaps because I never sent you the data.

The link to http://www.evertype.com/alphabets/azerbaijan.pdf is broken; 
and in http://www.evertype.com/alphabets/turkish.pdf the dotted capital 
I is missing, as viewed in Acrobat Reader 5.1 on Windows 2000.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




RE: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Jony Rosenne
What has iw to with Hebrew?

I wasn't involved with the change, but I'm glad it was done. Java and other
systems probably still use it because they never bothered to check the
latest version of 639. I know for certain that this was the case with one of
the major computer vendors.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries
 Sent: Saturday, July 12, 2003 2:12 PM
 To: Philippe Verdy; Doug Ewell
 Cc: [EMAIL PROTECTED]
 Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in 
 Turkish and Azeri, was: Accented ij ligatures)
 
 
 
 
 Samedi 12 juillet à 6h51, Doug Ewell [EMAIL PROTECTED] écrivit :
 
  The codes iw for Hebrew and in for Indonesian were deprecated 
  FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as 
  duplicates of he and id.  The Registration Authority 
 deprecates 
  such codes, rather than deleting them, for backward 
 compatibility with 
  any data that might contain the old codes.
 
 Just out of curiosity, why was « iw » deprecated ? Seems 
 perfectly fine to me. And why was « he » chosen (Herero, 
 Hemba, Hellenic Greek) ?
 
 P.A.
 
 
 
 
 




Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Patrick Andries

Michael Everson [EMAIL PROTECTED] écrivit :

 At 08:11 -0400 2003-07-12, Patrick Andries wrote:

 Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine
to
 me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ?

 Iwrit (iw), being a German transliteration of the name of the Hebrew
 language, and Jiddisch (ji) were both thought (by someone) to be less
 suitable than the English-based he and yi which replaced them.

This is also what I concluded, but  «iv» for ivrit could have pleased those
who thought the transliteration must be English-based (what a strange
idea!).

P. A.






Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Philippe Verdy
On Saturday, July 12, 2003 4:17 PM, Jony Rosenne [EMAIL PROTECTED] wrote:

 What has iw to with Hebrew?
 
 I wasn't involved with the change, but I'm glad it was done. Java and
 other systems probably still use it because they never bothered to
 check the latest version of 639. I know for certain that this was the
 case with one of the major computer vendors.

In the case of Java, I don't think so. Sun has certainly maintained the
language code simply to avoid breaking existing localizations to
Hebrew of Java-written software, waiting probably for a better way to
locate locales than the fixed locales path resolution algorithm which
is part of its core Classes since the beginning.

As long as Java core classes will not use a locale resolver that allows
tuning the resolution algorithm used to load resource bundles, while
also maintaining the compatibility with the existing softwares that
assume that Hebrew resources are loaded with the iw language code,
Sun will not change this code.

In IBM ICU4J, there is such an extended resolver, but Sun takes a
long time to approve such proposals, and have it first accepted,
documented, balloted and voted in its JCP program. Of course
Java already includes some parts of ICU, but other things are in
ICU4J are difficult now to integrate in Java, simply because IBM
forgot to modularize ICU so that it can be integrated slowly.
Accepting ICU4J as part of the core is a big decision choice,
because ICU4J is quite large, and there are certainly developers
for Java that would not accept to have 1 aditional MB of data and
classes loaded in each JVM (particularly because the integration
of ICU would affect a lot of core classes for the Java2 platform
now also used for small devices).

For example, it is impossible to integrate the ICU's Normalizer
class in Java without also importing the UChar class and all its
related services for UString, such as transliterators, and
advanced features such as the UCA tailoring rules run-time
compiler. Some ICU open-sourcers, as well as its users seem
to think now that the modularization of ICU is an important but
complex project.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




RE: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Kent Karlsson

 Note also: the Soft_Dotted property was created and considered
 specially for Turkish and Azeri.

Adding to the long, and unfortunately getting longer, list of misleading
statements from Philippe!  No, the reason for the Soft_Dotted property
was/is to mark which characters (regardless of language) that don't
display
intrinsic dot(s) above subglyph(s) when (another) combining character
above
is applied to it (and to then keep the dot(s) a combining dot above or a
combining diaeresis, as appropriate, must be used explicitly).

 In this language context the ASCII i is always rendered with a dot,
 kept also for uppercases.

I hope you don't mean to use a dotted glyph for U+0069!

B.t.w.  It is perfectly legal to use a ligature (in the TECHNICAL sense,
perhaps not the typographic sense) for f, i also for Turkish and
related
languages, especially if the f and i would otherwise overlap.  The point
is that f, i and f, dotless i must be clearly distinguishable for
these
languages, and that may mean that one has to use a TECHNICAL ligature
for f, i having a glyph where the dot on the i is clearly visible (the
horizontal bar of the f and the top serif of the i may still merge).
That may be done by whatever means that is better-looking for that
particular font, e.g. moving the loop of the f to the left, right, or
up.
(Using ZWNJ should not do that, if correctly implemented, but can
instead, mistakenly, result in overlapping f and dot-of-i glyphs, since
not 
even a technical ligature, IIUC (correct me if I'm wrong), would be
allowed...)

/kent k




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 1:12 PM, Kent Karlsson [EMAIL PROTECTED] wrote:

  Note also: the Soft_Dotted property was created and considered
  specially for Turkish and Azeri.
 
 Adding to the long, and unfortunately getting longer, list of
 misleading statements from Philippe!  No, the reason for the
 Soft_Dotted property was/is to mark which characters (regardless of
 language) that don't display intrinsic dot(s) above subglyph(s)
 when (another) combining character above
 is applied to it (and to then keep the dot(s) a combining dot above
 or a combining diaeresis, as appropriate, must be used explicitly).

I don't know how I can say, with my limited English, things without
being always accused of creating misleading things.

Correct things if you think my words create possible confusion in
their interpretation, but please don't over-exhibit them. I don't know
how non-English native writers can participate here if all differences
of interpretations caused by possible use of inappropriate English
terms are answered with flame. This is really frustrating...

The important words in my sentence is considered specially,
where specially does not imply only. It's just that Turkish and
Azeri are already given special treatment in Unicode, which already
includes language exceptions in its technical algorithms (notably
for character foldings).

And according to this treatment, the U+0069 character is already
intended to have a semantic value of a dotted i and not a dotless
i in languages where this creates a semantic difference, so the
question of the Soft_Dotted property is more glyphic than purely
semantic, and it has a semantic behavior (at the abstract text
level where Unicode is supposed to standardize things) mostly in
case folding operations where the actual encoding of the converted
abstract text is important.

The rest of the description of the Soft_Dotted property is mostly a
recommandation for authors of fonts and text renderers, so that
they should *preserve this semantic difference* in the rendered text
between abstract letters dotted and dotless i's... And this does
not affect the encoding of the abstract text or any algorithmic
transformation of the encoded abstract text.

By saying preserve this semantic difference*, I do not imply that
the U+0069 must/should have a dot above: it remains a font design
problem, out of scope of Unicode. There are certainly many ways
to preserve the semantic difference in the rendered text when this
is really appropriate (for example in Turkish and Azeri, or with a
distinct and emphasized rendering of the Turkish dot, including
in possible ligatures with other letters).

FLAME-OFF
And please, do not flame me if this message contains new
terms that also create confusion. I can reread the best I can,
and there are certainly other better ways to say the same thing
in English without these unintentional confusive interpretations,
and I am sorry by advance that such confusion still persist.

Accept the fact that I'm not a Unicode member and Unicode
is only one of my interests, and I have a lot of other
terminologies with which I have to work with.

If you can't accept that approximative English language may
be used by participants here, and refuse to understand the
real intent of users when they write here, then have this
group be moderated, but don't say it is open to discussions
from anybody using Unicode.

For normative aspects, with all exact terms, Unicode has its
web site, its publications, its data files, its working draft
documents, its technical committees, its permanent members,
its chaimans, and even bugcomment report forms to
interact with users at the normative level.
And I am sure that permanent Unicode members do not even
need this newsgroup to exchange their work on normative
documents that are directly sent to the working committee
bureaus, or via private email, phone calls, snail letters, or
their own web sites.
Please don't expect the same linguistic level quality here.

Also don't complain if my messages are long, but the constant
critics about what I am supposed to imply, gives me no
other choice than explaining always what I mean, and this is
particularly lengthy, and really boring in a newsgroup.
/FLAME-OFF

Thanks for your patience.

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Peter Kirk
On 11/07/2003 05:56, Philippe Verdy wrote:

Note also: the Soft_Dotted property was created and considered
specially for Turkish and Azeri.
 

Whatever it was that was specially created or adjusted for Turkish and 
Azeri, was it specifically restricted to these two languages? These are 
I think the only relatively major languages which use the special dotted 
and dotless i case mappings. But they are also used, at least in a small 
way, for minority languages of Turkey and Azerbaijan. (Use of these 
minority languages in Turkey is illegal, but that's another matter.) 
They were used in the 1930's for many Central Asian languages, and were 
at least proposed in the 1990's for newly introduced Latin alphabets. So 
I hope that what is fixed by Unicode is the name not of two languages 
but of an extensible family of scripts.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 3:50 PM, Peter Kirk [EMAIL PROTECTED] wrote:
 So I hope that what is fixed by Unicode is the name not
 of two languages but of an extensible family of scripts.

I think you speak about family of languages?

Good luck with ISO language codes which does not even
define them, and contain many duplicate codes even in
the Alpha-2 space (he/iw, in/id), or unprecize codes
matching sometimes very imprecize families of languages
overlapping other language codes...

Until it is demonstrated that a language needs such fix
in Unicode support tables, it's best to just say that these
fixes are needed for some recognized language codes and
that applications are allowed to add their own fixes or
language tailorings, and that the existing language
tailorings in Unicode databases are just non-normative
samples.

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Peter Kirk
On 11/07/2003 08:51, Philippe Verdy wrote:

On Friday, July 11, 2003 3:50 PM, Peter Kirk [EMAIL PROTECTED] wrote:
 

So I hope that what is fixed by Unicode is the name not
of two languages but of an extensible family of scripts.
   

I think you speak about family of languages?

Not really. A set of languages, but they are not all related in any way, 
and many of them have more than one script or alphabet so this is not 
really a property of the languages. Perhaps set of alphabets would be 
a better way to put it.

Good luck with ISO language codes which does not even
define them, and contain many duplicate codes even in
the Alpha-2 space (he/iw, in/id), or unprecize codes
matching sometimes very imprecize families of languages
overlapping other language codes...
Until it is demonstrated that a language needs such fix
in Unicode support tables, ...
If necessary I can collect some data to demonstrate this, at least for 
some languages.

... it's best to just say that these
fixes are needed for some recognized language codes and
that applications are allowed to add their own fixes or
language tailorings, and that the existing language
tailorings in Unicode databases are just non-normative
samples.
-- Philippe.



 

Agreed. But does Unicode actually treat them as non-normative samples?

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 6:43 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 Agreed. But does Unicode actually treat them as non-normative samples?

Note clear here: the reference documents say that these tables are
normative for applications that want to implement a conforming
case folding. But UTR#30 (characters folding) contains still many
areas marked as to be done, so it is not clear that all folding issues
have been solved. It seems reasonnable however that non language
specific elements in the CaseFolding table are normative, as they
are computed from UCD...

I see this comment:
[quote]
# The entries in this file are in the following machine-readable format:
#
# code; status; mapping; # name
#
# The status field is:
# C: common case folding, common mappings shared by both simple and full mappings.
# F: full case folding, mappings that cause strings to grow in length. Multiple
characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.
# T: special case for uppercase I and dotted uppercase I
#- For non-Turkic languages, this mapping is normally not used.
#- For Turkic languages (tr, az), this mapping can be used instead of the normal 
mapping for these characters.
#  Note that the Turkic mappings do not maintain canonical equivalence without 
additional processing.
#  See the discussions of case mapping in the Unicode Standard for more 
information.
#
# Usage:
#  A. To do a simple case folding, use the mappings with status C + S.
#  B. To do a full case folding, use the mappings with status C + F.
#
#The mappings with status T can be used or omitted depending on the desired 
case-folding
#behavior. (The default option is to exclude them.)
#
[/quote]

Simple Case Mapping (C+S) is not marked to be done in UTR#30, but other special 
mappings with status T are off by default (so they depend of a specific tailoring, a 
non-normative behavior if I interpret it correctly, as applications are free to use or 
not use them, under unspecified conditions, i.e. here the desired behavior).

This concerns many more characters than just Turkish/Azeri uses, and there is some 
overlap with the informative and unfinished UTR#30 reference:

(1) Simple mappings (are they normative?):

1F88; S; 1F80; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
1F89; S; 1F81; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
1F8A; S; 1F82; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F8B; S; 1F83; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F8C; S; 1F84; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F8D; S; 1F85; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F8E; S; 1F86; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND 
PROSGEGRAMMENI
1F8F; S; 1F87; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND 
PROSGEGRAMMENI

1F98; S; 1F90; # GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
1F99; S; 1F91; # GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
1F9A; S; 1F92; # GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F9B; S; 1F93; # GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F9C; S; 1F94; # GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F9D; S; 1F95; # GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F9E; S; 1F96; # GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
1F9F; S; 1F97; # GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI

1FA8; S; 1FA0; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
1FA9; S; 1FA1; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
1FAA; S; 1FA2; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1FAB; S; 1FA3; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1FAC; S; 1FA4; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1FAD; S; 1FA5; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1FAE; S; 1FA6; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND 
PROSGEGRAMMENI
1FAF; S; 1FA7; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND 
PROSGEGRAMMENI

1FBC; S; 1FB3; # GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
1FCC; S; 1FC3; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
1FFC; S; 1FF3; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI

(2) Full mappings (clearly optional):

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0149; F; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

0587; F; 0565 0582; # ARMENIAN SMALL LIGATURE ECH YIWN

1E96; F; 0068 0331; # LATIN SMALL LETTER H WITH LINE BELOW
1E97; 

ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-11 Thread Doug Ewell
Philippe Verdy verdy_p at wanadoo dot fr wrote:

 Good luck with ISO language codes which does not even
 define them, and contain many duplicate codes even in
 the Alpha-2 space (he/iw, in/id), or unprecize codes
 matching sometimes very imprecize families of languages
 overlapping other language codes...

The codes iw for Hebrew and in for Indonesian were deprecated
FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
duplicates of he and id.  The Registration Authority deprecates
such codes, rather than deleting them, for backward compatibility with
any data that might contain the old codes.

The part about codes for language families overlapping other codes for
specific languages is, regrettably, true.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 12:08 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 On 1st July Philippe Verdy wrote:
 
  If fonts still want to display dots on these characters, that's a
  rendering problem: there already exists a lot of fonts used for
  languages other than Turkish and Azeri, which do not display any
  dot on a lowercase ASCII i or j (dotted), and display a dot on their
  uppercase ASCII versions (normally not dotted with classic fonts)...
  
  The absence or presence of these dots is then seen as decorative
  even if these fonts are not suitable for Turkish and Azeri, but
  this is clearly not an encoding problem in the Unicode encoded text,
  and not a problem either for case conversions.
  
 
 Turkish and Azeri do not use the ij ligature. The sequences i - j and
 dotless i - j do occur (rarely, as j is a rare letter in both
 languages) but are treated as separate letters.

I know, and the quoted paragraph did not speak about the ij ligature
but effectively about the separate dotted/dotless i/I letters, for which
decorated fonts where the lowercase ASCII (dotted) i codepoint
uses a dotless glyph, or the uppercase ASCII (dotless) I codepoint
uses a dotted glyph (some fonts are ligating the dot with decorative
curves). These fonts are effectively not suitable for Turkish and
Azeri.

 In Turkish and Azeri the sequences f - i and f - dotless i both occur,
 and are fairly frequent. So it is inappropriate in these languages to
 use fi ligatures in which the dot on the i is lost or invisible, at
 least where the second character is a dotted i. Has any thought been
 given to this issue? Is it possible to block such ligation on a
 language-dependent basis?

Isn't there a Grapheme Disjoiner format control character to force the
absence of a ligature like fi, i.e. f, GDJ, i?

 Also it is certainly possible that in dictionaries etc in these
 languages stress might be marked by an accent on the vowel - as
 certainly in the older Cyrillic Azeri just as in Bulgarian as just
 posted. In this case the dot should not be removed from the dotted i
 when the stress mark is added, so that the distinction from dotless i
 is not lost. Has that issue been addressed? (In my Latin script Azeri
 dictionary stress is marked by a spacing grave accent before the
 vowel, but this may have been done precisely to work around this
 problem.) 

This is part of the proposal for review: an explicit combining dot-above
diacritic can be inserted between the normal (soft-dotted) base letter
and the above diacritic (with class 230):
latin-small-i, dot-above, accute-accent
cyrillic-small-je, dot-above, grave-accent

-- Philippe.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 08:21, Philippe Verdy wrote:

In Turkish and Azeri the sequences f - i and f - dotless i both occur,
and are fairly frequent. So it is inappropriate in these languages to
use fi ligatures in which the dot on the i is lost or invisible, at
least where the second character is a dotted i. Has any thought been
given to this issue? Is it possible to block such ligation on a
language-dependent basis?
   

Isn't there a Grapheme Disjoiner format control character to force the
absence of a ligature like fi, i.e. f, GDJ, i?
Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f - 
i sequence.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 5:41 PM, Peter Kirk [EMAIL PROTECTED] wrote:

  Isn't there a Grapheme Disjoiner format control character to
  force the absence of a ligature like fi, i.e. f, GDJ, i?
  
 Maybe, but it is hardly realistic to expect all existing Turkish and
 Azeri text to be recoded to insert a character in the middle of each
 f - i sequence.

Note also: the Soft_Dotted property was created and considered
specially for Turkish and Azeri.

In this language context the ASCII i is always rendered with a dot,
kept also for uppercases.

The other solution would be to use f, i, dot-above: the forced dot-above
diacritic avoids the ligature, and the sequence is rendered by two glyphs
for f and i, dot-above, i.e. the glyph for f, and the force-dotted
glyph for i.

Its uppercase conversion cause no problem:

F, I, dot-above
= F + I, dot-above
= F + I-dot-above

As well as additional stress diacritics:

f, i, dot-above, accute-accent
= f + i, dot-above, accute-accent
F, I, dot-above, accute-accent
= F + I-dot-above, accute-accent
= F + I-dot-above, accute-accent

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 09:34, Stefan Persson wrote:

Peter Kirk wrote:

 Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f 
- i sequence.

Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar 
code pages?  I that case, it would be enough to add the proper 
disjoiners to the proper Unicode conversion tables.

Stefan


There is no existing code page covering Azeri Latin, so everything is in 
Unicode or in one of a huge variety of custom solutions. See 
http://www.azer.com/aiweb/categories/magazine/81_folder/81_articles/81_standardfonts.html, 
and the article The Land of Azeri Fonts: It's a Jungle Out There in 
the same magazine issue, unfortunately not online, which summarises 20 
or so custom encodings all in current use.

Anyway, I understood from the recent discussion of Hebrew that it is 
Unicode policy not to do anything which could theoretically invalidate 
existing text even if it could be proved that no such text existed.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Stefan Persson
Peter Kirk wrote:

 Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f - 
i sequence.

Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code 
pages?  I that case, it would be enough to add the proper disjoiners to 
the proper Unicode conversion tables.

Stefan




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 6:42 PM, Peter Kirk [EMAIL PROTECTED] wrote:

 Anyway, I understood from the recent discussion of Hebrew that it is
 Unicode policy not to do anything which could theoretically invalidate
 existing text even if it could be proved that no such text existed.

Where does the fact of saying that a Grapheme Disjoiner can be used in Turkish to 
avoid that the f collapses the dot above a next lowercase i?

This does not change anything: existing texts can still produce ligatures in a 
renderer, unless explicitly said to not do so with a Grapheme Disjoiner, or the 
renderer is specially tuned to support the Turkish/Azeri languages. Existing texts do 
not need to be reencoded, if they are already correctly labelled with their language.

The absence of such language specifier will never forbid a renderer to choose a fi 
ligature if available, unless these renderers are made conforming by correctly 
interpreting the Grapheme Disjoiner to mean break the grapheme cluster here, and 
display the previous character(s), then the Grapheme Disjoiner can be rendered itself 
as a non-spacing empty glyph, then the rest of the string...

I'm still convinced that a ligature is still possible for a turkish f, dotted-i 
sequence, using f, i, dot-above. The ligature would apply to the middle bar of the 
f joined with the top serif of the i, but the top-right loop of the f would simply 
be a small horital bar, disjoined from the dot still present on the i.

The same ligature could be used for the encoded sequence f, dotless-i, so an actual 
font would render the glyphs for f, i, dot-above as a base ligature glyph for f, 
dotless-i (with a top horizontal bar for the f part), and add separately the 
dot-above glyph kerned into the existing f-dotless-i ligature.

To force disable this last ligature, we would use the encoded sequence f, GDJ, 
dot-less-i

According to unicode the sequence i, dot-above has always been valid, despite it 
apparently has the same dotted glyph for all languages. It differs only in the fact 
that the explicit dot-above removes the Soft_Dotted property of the previous i to 
make it dotless, followed by a forced diacritic.

So the encoded sequence i, dot-above is now made equivalent (for rendering 
purpose) to dotless-i, dot-above (despite they are not canonically equivalent per 
UAX#15: NFC/D) and not equivalent to an isolated i (not followed above 
diacritics)...

-- Philippe.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler
Peter Kirk asked:

  In Turkish and Azeri the sequences f - i and f - dotless i both occur,
  and are fairly frequent. So it is inappropriate in these languages to
  use fi ligatures in which the dot on the i is lost or invisible, at
  least where the second character is a dotted i. Has any thought been
  given to this issue? Is it possible to block such ligation on a
  language-dependent basis?
 

and Philippe Verdy responded with another question:

 Isn't there a Grapheme Disjoiner format control character to force the
 absence of a ligature like fi, i.e. f, GDJ, i?

The answer to Philippe's rejoinder question is no, there is not
a Grapheme Disjoiner format control character.

What Philippe has in mind, however, is covered in the standard
by the interaction of the joiner and non-joiner characters
with ligature control:

U+200C ZERO WIDTH NON-JOINER is intended to break both cursive
connections and ligatures in rendering.

ZWNJ requests that glyphs in the lowest available category
(for the given font) be used.

  -- Unicode 4.0, Section 15.2, Layout Controls

The categories referred to, from lowest to highest, are:

1. unconnected
2. cursively connected
3. ligated

At Peter pointed out, however, it is neither expected or reasonable
to have to go back through and drop in ZWNJ's at every relevant
location in existing Turkish or Azeri text, simply to prevent
fi ligation. Such use of ZWNJ is intended to be exceptional,
to deal with special cases.

The general solutions depend either on use of fonts (or more
generally, renderers) which block such ligation across the
board. It is my understanding that modern font technologies
allow the choice of ligation to essentially be a style selection
for the font. How well various applications take advantage
of that and make the choice available easily to end users may
be an open issue still, but the fundamental pieces to do this
correctly are available.

--Ken




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 8:37 PM, Kenneth Whistler [EMAIL PROTECTED] wrote:

 Peter Kirk asked:
 
   In Turkish and Azeri the sequences f - i and f - dotless i both
   occur, and are fairly frequent. So it is inappropriate in these
   languages to use fi ligatures in which the dot on the i is lost
   or invisible, at least where the second character is a dotted i.
   Has any thought been given to this issue? Is it possible to block
   such ligation on a language-dependent basis?
  
 
 and Philippe Verdy responded with another question:
 
  Isn't there a Grapheme Disjoiner format control character to
  force the absence of a ligature like fi, i.e. f, GDJ, i?
 
 The answer to Philippe's rejoinder question is no, there is not
 a Grapheme Disjoiner format control character.

I did not refer to a specific unicode character, I knew that there
is one already dedicated, but I did not want to comment about
this choice.

There's no contractiction. The Grapheme Disjoiner, for you is
ZWNJ. OK.

And I did not want to promote any change in any legally and
lecacy encoded text, only to suggest ways to solve the
apparent rendering problem in Turkish, when the f, i
encoded character pair may be badly rendered. For the actual
rendering, selecting a fi ligature is not appropriate for
Turkish, and in fact the canonically decomposed character
has no linguistic ambiguity in Turkish.

So what ever the fi encoded codepoint designates, it is not
the fi ligature glyoh but really two characters, whose ligation
may still be performed according to language context.

A font that would automatically select a fi ligature to represent
a sequence of f, i codepoints, from the fact that the fi
codepoint is canonically equivalent is probably  defective and not
conforming. Such selection of ligature must be put under the
control of the renderer with additional markup, which can in fact
select among three ligatures in Turkish: the fi ligature glyph
where the f is ligated with the dot above i (normal ligature for
languages other than Turkish/Azeri, the f-dotted-i and
f-fotted-i ligatures for Turkish/Azeri.

Markup is necessary to select the appropriate glyph, or this
can be selected by using the Grapheme Disjoiner (ZWNJ)
or the Grapheme Joiner (ZWJ) in addition to the use of
a i or dotless-i codepoint eventually followed by the
i-above diacritic. All this enrichment of text is assumed
to be under the control of the markup added to the original
text which does not need to specify whever ligatures should
or should not be used.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread John Cowan
Philippe Verdy scripsit:

 Where does the fact of saying that a Grapheme Disjoiner can be used
 in Turkish to avoid that the f collapses the dot above a next lowercase i?

It is settled that ZWNJ is the correct character to break ligatures.
ZWJ means make a ligature if you can; if not, shape characters to
joining forms if you can; if not that either, do nothing.  ZWNJ means
break ligatures, if any, and shape characters to non-joining forms,
if possible.

 I'm still convinced that a ligature is still possible for a turkish f,
 dotted-i sequence, using f, i, dot-above. The ligature would apply
 to the middle bar of the f joined with the top serif of the i,
 but the top-right loop of the f would simply be a small horital bar,
 disjoined from the dot still present on the i.

Yes, theoretically.  Whether that is good Turkish typography is a different
question, which AFAIK prefers simply an f-glyph followed by an i-glyph with
no ligaturing.

IIRC, Portuguese traditional typography also avoids the fi-ligature, even though
the language has no dotless-i.

 The same ligature could be used for the encoded sequence f, dotless-i, 

I doubt that any font has a ligature for this combination at all.

 So the encoded sequence i, dot-above is now made equivalent
 (for rendering purpose) to dotless-i, dot-above (despite they are
 not canonically equivalent per UAX#15: NFC/D) and not equivalent
 to an isolated i (not followed above diacritics)...

There is no guarantee that the native i dot looks the same as the dot above
in a given font (it may have different vertical kerning or even a different
shape), nor is there any guarantee that the i with its dot removed looks
the same as the dotless-i.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
'My young friend, if you do not now, immediately and instantly, pull
as hard as ever you can, it is my opinion that your acquaintance in the
large-pattern leather ulster' (and by this he meant the Crocodile) 'will
jerk you into yonder limpid stream before you can say Jack Robinson.'
--the Bi-Coloured-Python-Rock-Snake



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 11:37, Kenneth Whistler wrote:

At Peter pointed out, however, it is neither expected or reasonable
to have to go back through and drop in ZWNJ's at every relevant
location in existing Turkish or Azeri text, simply to prevent
fi ligation. Such use of ZWNJ is intended to be exceptional,
to deal with special cases.
The general solutions depend either on use of fonts (or more
generally, renderers) which block such ligation across the
board. It is my understanding that modern font technologies
allow the choice of ligation to essentially be a style selection
for the font. How well various applications take advantage
of that and make the choice available easily to end users may
be an open issue still, but the fundamental pieces to do this
correctly are available.
 

Thank you, Ken. I think you get my point. I am not so interested in 
character level mechaisms for disabling the ligature as in higher level 
features. But I guess I am really thinking in terms of markup, so 
outside the domain of Unicode, which might disable ligation.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Laurentiu Iancu
See also
http://www.microsoft.com/typography/developers/opentype/detail.htm
which explains how ligatures can be turned off on a language-dependent basis.

Laurentiu


Peter Kirk asked:

 In Turkish and Azeri the sequences f - i and f - dotless i both occur,
 and are fairly frequent. So it is inappropriate in these languages to
 use fi ligatures in which the dot on the i is lost or invisible, at
 least where the second character is a dotted i. Has any thought been
 given to this issue? Is it possible to block such ligation on a
 language-dependent basis?




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler

  and Philippe Verdy responded with another question:
  
   Isn't there a Grapheme Disjoiner format control character to
   force the absence of a ligature like fi, i.e. f, GDJ, i?
  
  The answer to Philippe's rejoinder question is no, there is not
  a Grapheme Disjoiner format control character.
 
 I did not refer to a specific unicode character, I knew that there
 is one already dedicated, but I did not want to comment about
 this choice.
 
 There's no contractiction. The Grapheme Disjoiner, for you is
 ZWNJ. OK.

ad hominem

Every so often, Philippe, it would be refreshing if, when someone
points out in error in your claims about the Unicode Standard,
that you would simply acknowledge the error and discontinue
making the claim, instead of coming back trying to claim that
the error was just another way of being right.

/ad hominem

There is a separate character, U+034F COMBINING GRAPHEME JOINER,
which is the grapheme joiner, abbreviation CGJ in the
standard. That character has nothing to do with ligation
control. There has also been debate, on several occasions,
within the UTC, regarding the advisability of encoding
a grapheme non-joiner, as a pair with the grapheme joiner.
But again, such a grapheme non-joiner -- which has *not* been
encoded, by the way -- would have nothing to do with ligation
control.

So it is a disservice to the list, perpetuating confusion, to
invent the term Grapheme Disjoiner and use it in a series
of notes regarding ligation control, when the standard already
designates the ZWJ and the ZWNJ as the relevant controls
related to ligation control.

So it is not that for me the Grapheme Disjoiner is the ZWNJ;
rather, it is for the Unicode Standard that the ZWNJ is the
designated, standardized format control for ligation control
of the sort you are talking about. Please learn the terminology
and make correct use of it.

 A font that would automatically select a fi ligature to represent
 a sequence of f, i codepoints, from the fact that the fi
 codepoint is canonically equivalent

U+FB01 LATIN SMALL LIGATURE FI is not a *canonical* equivalent to
f, i; it is *compatibility* equivalent. That is an important
distinction.

 is probably  defective and not
 conforming. 

Wrong. There is nothing nonconformant about fonts automatically
ligating f, i (or any other sequence). Such automatic
ligation may not always be appropriate or the desired result
for an end user, but that has nothing to do with the conformance
requirements of the standard.

 Such selection of ligature must be put under the
 
 
Wrong. must -- may

 control of the renderer with additional markup, which can in fact
 select among three ligatures in Turkish: the fi ligature glyph
 where the f is ligated with the dot above i (normal ligature for
 languages other than Turkish/Azeri, the f-dotted-i and
 f-fotted-i ligatures for Turkish/Azeri.

It is unclear that any such ligatures are required or desireable
for Turkish/Azeri, in any case.

 Markup is necessary to select the appropriate glyph, or this
  ^^^
  
Wrong. A higher-level protocol is needed, and that may involve
markup. But the Turkish requirements can equally well be
met by simply setting no ligature style settings for
the relevant fonts.

 can be selected by using the Grapheme Disjoiner (ZWNJ)
   
   
Wrong term. See above.

 or the Grapheme Joiner (ZWJ) in addition to the use of
 ^
 
Wrong term. See above.

 a i or dotless-i codepoint eventually followed by the
 i-above diacritic.

And in any case, it is inadvisable to be suggesting use of
ZWJ and ZWNJ in this way to solve the problem of assuring that
Turkish texts don't ligate inappropriately on rendering. 

 All this enrichment of text is assumed
 to be under the control of the markup added to the original
 text which does not need to specify whever ligatures should
 or should not be used.

This last clause I agree with. But the implication that
markup has to be added to Turkish text in order to get it
to render correctly regarding ligature usage is incorrect.
Adding markup to the text is adding to the original text
as surely as adding ZWNJ format controls would be. In any
case it is unnecessary, since alternatives exist which simply
specify suppression (or use) of ligatures stylistically in
the fonts.

--Ken




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread James H. Cloos Jr.
 Peter == Peter Kirk [EMAIL PROTECTED] writes:

Peter Maybe, but it is hardly realistic to expect all existing
Peter Turkish and Azeri text to be recoded to insert a character in
Peter the middle of each f - i sequence.

But a lot of it already does do that.  In TeX Turkish uses f{}i to
block the (fonts) ligation.  roff does something similar.  Im
sure all of the other text-source publishing systems do as well.

Even the WYSI(NR)WYG must be doming something to accomplish that.

-JimC

 NR  Not Really