Re: Ligatures in Turkish and Azeri

2003-07-12 Thread Anto'nio Martins-Tuva'lkin
On 2003.07.10, 20:34, John Cowan <[EMAIL PROTECTED]> wrote:

> IIRC, Portuguese traditional typography also avoids the fi-ligature,
> even though the language has no dotless-i.

Just browsed some old book with that in mind and I cannot really
corroborate. I've even seen some other more exotic ligatures, such as
"st" and "ct".

Maybe there was such a reccomendation in some portugguese type-setting
manual, but its result doesn't show...

--   .
António MARTINS-Tuválkin,   |  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 934 821 700 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |




Re: Ligatures in Turkish and Azeri

2003-07-15 Thread Anto'nio Martins-Tuva'lkin
On 2003.07.12, 20:59, Anto'nio Martins-Tuva'lkin
<[EMAIL PROTECTED]> wrote:

> Just browsed some old book with that in mind

I here meant rather "books", plural. And I'll keep an eye for this in
the future.

--   .
António MARTINS-Tuválkin,   |  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 934 821 700 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |




Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 1st July Philippe Verdy wrote:

If fonts still want to display dots on these characters, that's a
rendering problem: there already exists a lot of fonts used for
languages other than Turkish and Azeri, which do not display any
dot on a lowercase ASCII i or j (dotted), and display a dot on their
uppercase ASCII versions (normally not dotted with classic fonts)...
The absence or presence of these dots is then seen as "decorative"
even if these fonts are not suitable for Turkish and Azeri, but this is
clearly not an encoding problem in the Unicode encoded text,
and not a problem either for case conversions.
Turkish and Azeri do not use the ij ligature. The sequences i - j and 
dotless i - j do occur (rarely, as j is a rare letter in both languages) 
but are treated as separate letters.

In Turkish and Azeri the sequences f - i and f - dotless i both occur, 
and are fairly frequent. So it is inappropriate in these languages to 
use fi ligatures in which the dot on the i is lost or invisible, at 
least where the second character is a dotted i. Has any thought been 
given to this issue? Is it possible to block such ligation on a 
language-dependent basis?

Also it is certainly possible that in dictionaries etc in these 
languages stress might be marked by an accent on the vowel - as 
certainly in the older Cyrillic Azeri just as in Bulgarian as just 
posted. In this case the dot should not be removed from the dotted i 
when the stress mark is added, so that the distinction from dotless i is 
not lost. Has that issue been addressed? (In my Latin script Azeri 
dictionary stress is marked by a spacing grave accent before the vowel, 
but this may have been done precisely to work around this problem.)

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 12:08 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> On 1st July Philippe Verdy wrote:
> 
> > If fonts still want to display dots on these characters, that's a
> > rendering problem: there already exists a lot of fonts used for
> > languages other than Turkish and Azeri, which do not display any
> > dot on a lowercase ASCII i or j (dotted), and display a dot on their
> > uppercase ASCII versions (normally not dotted with classic fonts)...
> > 
> > The absence or presence of these dots is then seen as "decorative"
> > even if these fonts are not suitable for Turkish and Azeri, but
> > this is clearly not an encoding problem in the Unicode encoded text,
> > and not a problem either for case conversions.
> > 
> 
> Turkish and Azeri do not use the ij ligature. The sequences i - j and
> dotless i - j do occur (rarely, as j is a rare letter in both
> languages) but are treated as separate letters.

I know, and the quoted paragraph did not speak about the ij ligature
but effectively about the separate dotted/dotless i/I letters, for which
"decorated" fonts where the lowercase ASCII (dotted) i codepoint
uses a dotless glyph, or the uppercase ASCII (dotless) I codepoint
uses a dotted glyph (some fonts are ligating the dot with decorative
curves). These fonts are effectively not suitable for Turkish and
Azeri.

> In Turkish and Azeri the sequences f - i and f - dotless i both occur,
> and are fairly frequent. So it is inappropriate in these languages to
> use fi ligatures in which the dot on the i is lost or invisible, at
> least where the second character is a dotted i. Has any thought been
> given to this issue? Is it possible to block such ligation on a
> language-dependent basis?

Isn't there a "Grapheme Disjoiner" format control character to force the
absence of a ligature like , i.e. ?

> Also it is certainly possible that in dictionaries etc in these
> languages stress might be marked by an accent on the vowel - as
> certainly in the older Cyrillic Azeri just as in Bulgarian as just
> posted. In this case the dot should not be removed from the dotted i
> when the stress mark is added, so that the distinction from dotless i
> is not lost. Has that issue been addressed? (In my Latin script Azeri
> dictionary stress is marked by a spacing grave accent before the
> vowel, but this may have been done precisely to work around this
> problem.) 

This is part of the proposal for review: an explicit combining dot-above
diacritic can be inserted between the normal (soft-dotted) base letter
and the above diacritic (with class 230):



-- Philippe.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 08:21, Philippe Verdy wrote:

In Turkish and Azeri the sequences f - i and f - dotless i both occur,
and are fairly frequent. So it is inappropriate in these languages to
use fi ligatures in which the dot on the i is lost or invisible, at
least where the second character is a dotted i. Has any thought been
given to this issue? Is it possible to block such ligation on a
language-dependent basis?
   

Isn't there a "Grapheme Disjoiner" format control character to force the
absence of a ligature like , i.e. ?
Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f - 
i sequence.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 5:41 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> > Isn't there a "Grapheme Disjoiner" format control character to
> > force the absence of a ligature like , i.e. ?
> > 
> Maybe, but it is hardly realistic to expect all existing Turkish and
> Azeri text to be recoded to insert a character in the middle of each
> f - i sequence.

Note also: the Soft_Dotted property was created and considered
specially for Turkish and Azeri.

In this language context the ASCII i is always rendered with a dot,
kept also for uppercases.

The other solution would be to use : the forced dot-above
diacritic avoids the ligature, and the sequence is rendered by two glyphs
for  and , i.e. the glyph for , and the force-dotted
glyph for .

Its uppercase conversion cause no problem:


=  + 
=  + 

As well as additional stress diacritics:


=  + 

=  + 
=  + 

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 09:34, Stefan Persson wrote:

Peter Kirk wrote:

> Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f 
- i sequence.

Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar 
code pages?  I that case, it would be enough to add the proper 
disjoiners to the proper Unicode conversion tables.

Stefan


There is no existing code page covering Azeri Latin, so everything is in 
Unicode or in one of a huge variety of custom solutions. See 
http://www.azer.com/aiweb/categories/magazine/81_folder/81_articles/81_standardfonts.html, 
and the article "The Land of Azeri Fonts: It's a Jungle Out There" in 
the same magazine issue, unfortunately not online, which summarises 20 
or so custom encodings all in current use.

Anyway, I understood from the recent discussion of Hebrew that it is 
Unicode policy not to do anything which could theoretically invalidate 
existing text even if it could be proved that no such text existed.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Stefan Persson
Peter Kirk wrote:

> Maybe, but it is hardly realistic to expect all existing Turkish and 
Azeri text to be recoded to insert a character in the middle of each f - 
i sequence.

Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code 
pages?  I that case, it would be enough to add the proper disjoiners to 
the proper Unicode conversion tables.

Stefan




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 6:42 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> Anyway, I understood from the recent discussion of Hebrew that it is
> Unicode policy not to do anything which could theoretically invalidate
> existing text even if it could be proved that no such text existed.

Where does the fact of saying that a Grapheme Disjoiner can be used in Turkish to 
avoid that the f collapses the dot above a next lowercase i?

This does not change anything: existing texts can still produce ligatures in a 
renderer, unless explicitly said to not do so with a Grapheme Disjoiner, or the 
renderer is specially tuned to support the Turkish/Azeri languages. Existing texts do 
not need to be reencoded, if they are already correctly labelled with their language.

The absence of such language specifier will never forbid a renderer to choose a fi 
ligature if available, unless these renderers are made conforming by correctly 
interpreting the Grapheme Disjoiner to mean "break the grapheme cluster here, and 
display the previous character(s)", then the Grapheme Disjoiner can be rendered itself 
as a non-spacing empty glyph, then the rest of the string...

I'm still convinced that a ligature is still possible for a turkish  
sequence, using . The ligature would apply to the middle bar of the 
 joined with the top serif of the , but the top-right loop of the f would simply 
be a small horital bar, disjoined from the dot still present on the i.

The same ligature could be used for the encoded sequence , so an actual 
font would render the glyphs for  as a base ligature glyph for  (with a top horizontal bar for the  part), and add separately the 
 glyph kerned into the existing  ligature.

To force disable this last ligature, we would use the encoded sequence 

According to unicode the sequence  has always been valid, despite it 
apparently has the same dotted glyph for all languages. It differs only in the fact 
that the explicit  removes the Soft_Dotted property of the previous  to 
make it dotless, followed by a forced diacritic.

So the encoded sequence  is now made "equivalent" (for rendering 
purpose) to  (despite they are not canonically equivalent per 
UAX#15: NFC/D) and not "equivalent" to an isolated  (not followed above 
diacritics)...

-- Philippe.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler
Peter Kirk asked:

> > In Turkish and Azeri the sequences f - i and f - dotless i both occur,
> > and are fairly frequent. So it is inappropriate in these languages to
> > use fi ligatures in which the dot on the i is lost or invisible, at
> > least where the second character is a dotted i. Has any thought been
> > given to this issue? Is it possible to block such ligation on a
> > language-dependent basis?
> 

and Philippe Verdy responded with another question:

> Isn't there a "Grapheme Disjoiner" format control character to force the
> absence of a ligature like , i.e. ?

The answer to Philippe's rejoinder question is no, there is not
a "Grapheme Disjoiner" format control character.

What Philippe has in mind, however, is covered in the standard
by the interaction of the joiner and non-joiner characters
with ligature control:

"U+200C ZERO WIDTH NON-JOINER is intended to break both cursive
connections and ligatures in rendering.

"ZWNJ requests that glyphs in the lowest available category
(for the given font) be used."

  -- Unicode 4.0, Section 15.2, Layout Controls

The categories referred to, from lowest to highest, are:

1. unconnected
2. cursively connected
3. ligated

At Peter pointed out, however, it is neither expected or reasonable
to have to go back through and drop in ZWNJ's at every relevant
location in existing Turkish or Azeri text, simply to prevent
fi ligation. Such use of ZWNJ is intended to be exceptional,
to deal with special cases.

The general solutions depend either on use of fonts (or more
generally, renderers) which block such ligation across the
board. It is my understanding that modern font technologies
allow the choice of ligation to essentially be a style selection
for the font. How well various applications take advantage
of that and make the choice available easily to end users may
be an open issue still, but the fundamental pieces to do this
correctly are available.

--Ken




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Philippe Verdy
On Thursday, July 10, 2003 8:37 PM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:

> Peter Kirk asked:
> 
> > > In Turkish and Azeri the sequences f - i and f - dotless i both
> > > occur, and are fairly frequent. So it is inappropriate in these
> > > languages to use fi ligatures in which the dot on the i is lost
> > > or invisible, at least where the second character is a dotted i.
> > > Has any thought been given to this issue? Is it possible to block
> > > such ligation on a language-dependent basis?
> > 
> 
> and Philippe Verdy responded with another question:
> 
> > Isn't there a "Grapheme Disjoiner" format control character to
> > force the absence of a ligature like , i.e. ?
> 
> The answer to Philippe's rejoinder question is no, there is not
> a "Grapheme Disjoiner" format control character.

I did not refer to a specific unicode character, I knew that there
is one already dedicated, but I did not want to comment about
this choice.

There's no contractiction. The Grapheme Disjoiner, for you is
ZWNJ. OK.

And I did not want to promote any change in any legally and
lecacy encoded text, only to suggest ways to solve the
apparent rendering problem in Turkish, when the 
encoded character pair may be badly rendered. For the actual
rendering, selecting a  ligature is not appropriate for
Turkish, and in fact the canonically decomposed character
has no linguistic ambiguity in Turkish.

So what ever the  encoded codepoint designates, it is not
the  ligature glyoh but really two characters, whose ligation
may still be performed according to language context.

A font that would automatically select a  ligature to represent
a sequence of  codepoints, from the fact that the 
codepoint is canonically equivalent is probably  defective and not
conforming. Such selection of ligature must be put under the
control of the renderer with additional markup, which can in fact
select among three ligatures in Turkish: the  ligature glyph
where the f is ligated with the dot above i (normal ligature for
languages other than Turkish/Azeri, the  and
 ligatures for Turkish/Azeri.

Markup is necessary to select the appropriate glyph, or this
can be selected by using the "Grapheme Disjoiner" (ZWNJ)
or the "Grapheme Joiner" (ZWJ) in addition to the use of
a  or  codepoint eventually followed by the
 diacritic. All this enrichment of text is assumed
to be under the control of the markup added to the original
text which does not need to specify whever ligatures should
or should not be used.



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread John Cowan
Philippe Verdy scripsit:

> Where does the fact of saying that a Grapheme Disjoiner can be used
> in Turkish to avoid that the f collapses the dot above a next lowercase i?

It is settled that ZWNJ is the correct character to break ligatures.
ZWJ means "make a ligature if you can; if not, shape characters to
joining forms if you can; if not that either, do nothing."  ZWNJ means
"break ligatures, if any, and shape characters to non-joining forms,
if possible."

> I'm still convinced that a ligature is still possible for a turkish  dotted-i> sequence, using . The ligature would apply
> to the middle bar of the  joined with the top serif of the ,
> but the top-right loop of the f would simply be a small horital bar,
> disjoined from the dot still present on the i.

Yes, theoretically.  Whether that is good Turkish typography is a different
question, which AFAIK prefers simply an f-glyph followed by an i-glyph with
no ligaturing.

IIRC, Portuguese traditional typography also avoids the fi-ligature, even though
the language has no dotless-i.

> The same ligature could be used for the encoded sequence , 

I doubt that any font has a ligature for this combination at all.

> So the encoded sequence  is now made "equivalent"
> (for rendering purpose) to  (despite they are
> not canonically equivalent per UAX#15: NFC/D) and not "equivalent"
> to an isolated  (not followed above diacritics)...

There is no guarantee that the native i dot looks the same as the dot above
in a given font (it may have different vertical kerning or even a different
shape), nor is there any guarantee that the i with its dot removed looks
the same as the dotless-i.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
"'My young friend, if you do not now, immediately and instantly, pull
as hard as ever you can, it is my opinion that your acquaintance in the
large-pattern leather ulster' (and by this he meant the Crocodile) 'will
jerk you into yonder limpid stream before you can say Jack Robinson.'"
--the Bi-Coloured-Python-Rock-Snake



Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Peter Kirk
On 10/07/2003 11:37, Kenneth Whistler wrote:

At Peter pointed out, however, it is neither expected or reasonable
to have to go back through and drop in ZWNJ's at every relevant
location in existing Turkish or Azeri text, simply to prevent
fi ligation. Such use of ZWNJ is intended to be exceptional,
to deal with special cases.
The general solutions depend either on use of fonts (or more
generally, renderers) which block such ligation across the
board. It is my understanding that modern font technologies
allow the choice of ligation to essentially be a style selection
for the font. How well various applications take advantage
of that and make the choice available easily to end users may
be an open issue still, but the fundamental pieces to do this
correctly are available.
 

Thank you, Ken. I think you get my point. I am not so interested in 
character level mechaisms for disabling the ligature as in higher level 
features. But I guess I am really thinking in terms of markup, so 
outside the domain of Unicode, which might disable ligation.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Laurentiu Iancu
See also
http://www.microsoft.com/typography/developers/opentype/detail.htm
which explains how ligatures can be turned off on a language-dependent basis.

Laurentiu


Peter Kirk asked:

> In Turkish and Azeri the sequences f - i and f - dotless i both occur,
> and are fairly frequent. So it is inappropriate in these languages to
> use fi ligatures in which the dot on the i is lost or invisible, at
> least where the second character is a dotted i. Has any thought been
> given to this issue? Is it possible to block such ligation on a
> language-dependent basis?




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler

> > and Philippe Verdy responded with another question:
> > 
> > > Isn't there a "Grapheme Disjoiner" format control character to
> > > force the absence of a ligature like , i.e. ?
> > 
> > The answer to Philippe's rejoinder question is no, there is not
> > a "Grapheme Disjoiner" format control character.
> 
> I did not refer to a specific unicode character, I knew that there
> is one already dedicated, but I did not want to comment about
> this choice.
> 
> There's no contractiction. The Grapheme Disjoiner, for you is
> ZWNJ. OK.



Every so often, Philippe, it would be refreshing if, when someone
points out in error in your claims about the Unicode Standard,
that you would simply acknowledge the error and discontinue
making the claim, instead of coming back trying to claim that
the error was just another way of being right.



There is a separate character, U+034F COMBINING GRAPHEME JOINER,
which is the "grapheme joiner", abbreviation "CGJ" in the
standard. That character has nothing to do with ligation
control. There has also been debate, on several occasions,
within the UTC, regarding the advisability of encoding
a "grapheme non-joiner", as a pair with the "grapheme joiner".
But again, such a grapheme non-joiner -- which has *not* been
encoded, by the way -- would have nothing to do with ligation
control.

So it is a disservice to the list, perpetuating confusion, to
invent the term "Grapheme Disjoiner" and use it in a series
of notes regarding ligation control, when the standard already
designates the ZWJ and the ZWNJ as the relevant controls
related to ligation control.

So it is not that for me "the Grapheme Disjoiner is the ZWNJ";
rather, it is for the Unicode Standard that the ZWNJ is the
designated, standardized format control for ligation control
of the sort you are talking about. Please learn the terminology
and make correct use of it.

> A font that would automatically select a  ligature to represent
> a sequence of  codepoints, from the fact that the 
> codepoint is canonically equivalent

U+FB01 LATIN SMALL LIGATURE FI is not a *canonical* equivalent to
; it is *compatibility* equivalent. That is an important
distinction.

> is probably  defective and not
> conforming. 

Wrong. There is nothing nonconformant about fonts automatically
ligating  (or any other sequence). Such automatic
ligation may not always be appropriate or the desired result
for an end user, but that has nothing to do with the conformance
requirements of the standard.

> Such selection of ligature must be put under the
 
 
Wrong. "must" --> "may"

> control of the renderer with additional markup, which can in fact
> select among three ligatures in Turkish: the  ligature glyph
> where the f is ligated with the dot above i (normal ligature for
> languages other than Turkish/Azeri, the  and
>  ligatures for Turkish/Azeri.

It is unclear that any such ligatures are required or desireable
for Turkish/Azeri, in any case.

> Markup is necessary to select the appropriate glyph, or this
  ^^^
  
Wrong. A higher-level protocol is needed, and that may involve
markup. But the Turkish requirements can equally well be
met by simply setting "no ligature" style settings for
the relevant fonts.

> can be selected by using the "Grapheme Disjoiner" (ZWNJ)
   
   
Wrong term. See above.

> or the "Grapheme Joiner" (ZWJ) in addition to the use of
 ^
 
Wrong term. See above.

> a  or  codepoint eventually followed by the
>  diacritic.

And in any case, it is inadvisable to be suggesting use of
ZWJ and ZWNJ in this way to solve the problem of assuring that
Turkish texts don't ligate inappropriately on rendering. 

> All this enrichment of text is assumed
> to be under the control of the markup added to the original
> text which does not need to specify whever ligatures should
> or should not be used.

This last clause I agree with. But the implication that
markup has to be added to Turkish text in order to get it
to render correctly regarding ligature usage is incorrect.
Adding markup to the text is "adding to the original text"
as surely as adding ZWNJ format controls would be. In any
case it is unnecessary, since alternatives exist which simply
specify suppression (or use) of ligatures stylistically in
the fonts.

--Ken




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread James H. Cloos Jr.
> "Peter" == Peter Kirk <[EMAIL PROTECTED]> writes:

Peter> Maybe, but it is hardly realistic to expect all existing
Peter> Turkish and Azeri text to be recoded to insert a character in
Peter> the middle of each f - i sequence.

But a lot of it already does do that.  In TeX Turkish uses f{}i to
block the (font’s) ligation.  ’roff does something similar.  I’m
sure all of the other text-source publishing systems do as well.

Even the WYSI(NR)WYG¹ must be doming something to accomplish that.

-JimC

¹ NR ≡ Not Really




RE: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Kent Karlsson

> Note also: the Soft_Dotted property was created and considered
> specially for Turkish and Azeri.

Adding to the long, and unfortunately getting longer, list of misleading
statements from Philippe!  No, the reason for the Soft_Dotted property
was/is to mark which characters (regardless of language) that don't
display
intrinsic dot(s) above subglyph(s) when (another) combining character
above
is applied to it (and to then keep the dot(s) a combining dot above or a
combining diaeresis, as appropriate, must be used explicitly).

> In this language context the ASCII i is always rendered with a dot,
> kept also for uppercases.

I hope you don't mean to use a dotted glyph for U+0069!

B.t.w.  It is perfectly legal to use a ligature (in the TECHNICAL sense,
perhaps not the typographic sense) for  also for Turkish and
related
languages, especially if the f and i would otherwise overlap.  The point
is that  and  must be clearly distinguishable for
these
languages, and that may mean that one has to use a TECHNICAL ligature
for  having a glyph where the dot on the i is clearly visible (the
horizontal bar of the f and the top serif of the i may still merge).
That may be done by whatever means that is better-looking for that
particular font, e.g. moving the loop of the f to the left, right, or
up.
(Using ZWNJ should not do that, if correctly implemented, but can
instead, mistakenly, result in overlapping f and dot-of-i glyphs, since
not 
even a technical ligature, IIUC (correct me if I'm wrong), would be
allowed...)

/kent k




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 1:12 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote:

> > Note also: the Soft_Dotted property was created and considered
> > specially for Turkish and Azeri.
> 
> Adding to the long, and unfortunately getting longer, list of
> misleading statements from Philippe!  No, the reason for the
> Soft_Dotted property was/is to mark which characters (regardless of
> language) that don't display intrinsic dot(s) above subglyph(s)
> when (another) combining character above
> is applied to it (and to then keep the dot(s) a combining dot above
> or a combining diaeresis, as appropriate, must be used explicitly).

I don't know how I can say, with my limited English, things without
being always accused of creating misleading things.

Correct things if you think my words create possible confusion in
their interpretation, but please don't over-exhibit them. I don't know
how non-English native writers can participate here if all differences
of interpretations caused by possible use of inappropriate English
terms are answered with flame. This is really frustrating...

The important words in my sentence is "considered specially",
where "specially" does not imply "only". It's just that Turkish and
Azeri are already given special treatment in Unicode, which already
includes language exceptions in its technical algorithms (notably
for character foldings).

And according to this treatment, the U+0069 character is already
intended to have a semantic value of a dotted  and not a dotless
 in languages where this creates a semantic difference, so the
question of the "Soft_Dotted" property is more glyphic than purely
semantic, and it has a semantic behavior (at the abstract text
level where Unicode is supposed to standardize things) mostly in
case folding operations where the actual encoding of the converted
abstract text is important.

The rest of the description of the Soft_Dotted property is mostly a
recommandation for authors of fonts and text renderers, so that
they should *preserve this semantic difference* in the rendered text
between abstract letters dotted and dotless 's... And this does
not affect the encoding of the abstract text or any algorithmic
transformation of the encoded abstract text.

By saying "preserve this semantic difference*, I do not imply that
the U+0069 must/should have a dot above: it remains a font design
problem, out of scope of Unicode. There are certainly many ways
to preserve the semantic difference in the rendered text when this
is really appropriate (for example in Turkish and Azeri, or with a
distinct and emphasized rendering of the Turkish dot, including
in possible ligatures with other letters).


And please, do not flame me if this message contains new
terms that also create confusion. I can reread the best I can,
and there are certainly other better ways to say the same thing
in English without these unintentional confusive interpretations,
and I am sorry by advance that such confusion still persist.

Accept the fact that I'm not a Unicode member and Unicode
is only one of my interests, and I have a lot of other
terminologies with which I have to work with.

If you can't accept that approximative English language may
be used by participants here, and refuse to understand the
real intent of users when they write here, then have this
group be moderated, but don't say it is open to discussions
from anybody using Unicode.

For normative aspects, with all exact terms, Unicode has its
web site, its publications, its data files, its working draft
documents, its technical committees, its permanent members,
its chaimans, and even bug&comment report forms to
interact with users at the normative level.
And I am sure that permanent Unicode members do not even
need this newsgroup to exchange their work on normative
documents that are directly sent to the working committee
bureaus, or via private email, phone calls, snail letters, or
their own web sites.
Please don't expect the same linguistic level quality here.

Also don't complain if my messages are long, but the constant
critics about what I am "supposed" to "imply", gives me no
other choice than explaining always what I mean, and this is
particularly lengthy, and really boring in a newsgroup.


Thanks for your patience.

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Peter Kirk
On 11/07/2003 05:56, Philippe Verdy wrote:

Note also: the Soft_Dotted property was created and considered
specially for Turkish and Azeri.
 

Whatever it was that was specially created or adjusted for Turkish and 
Azeri, was it specifically restricted to these two languages? These are 
I think the only relatively major languages which use the special dotted 
and dotless i case mappings. But they are also used, at least in a small 
way, for minority languages of Turkey and Azerbaijan. (Use of these 
minority languages in Turkey is illegal, but that's another matter.) 
They were used in the 1930's for many Central Asian languages, and were 
at least proposed in the 1990's for newly introduced Latin alphabets. So 
I hope that what is fixed by Unicode is the name not of two languages 
but of an extensible family of scripts.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 3:50 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:
> So I hope that what is fixed by Unicode is the name not
> of two languages but of an extensible family of scripts.

I think you speak about family of languages?

Good luck with ISO language codes which does not even
define them, and contain many duplicate codes even in
the Alpha-2 space (he/iw, in/id), or unprecize codes
matching sometimes very imprecize families of languages
overlapping other language codes...

Until it is demonstrated that a language needs such fix
in Unicode support tables, it's best to just say that these
fixes are needed for some recognized language codes and
that applications are allowed to add their own "fixes" or
language tailorings, and that the existing language
tailorings in Unicode databases are just non-normative
samples.

-- Philippe.




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Peter Kirk
On 11/07/2003 08:51, Philippe Verdy wrote:

On Friday, July 11, 2003 3:50 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:
 

So I hope that what is fixed by Unicode is the name not
of two languages but of an extensible family of scripts.
   

I think you speak about family of languages?

Not really. A set of languages, but they are not all related in any way, 
and many of them have more than one script or alphabet so this is not 
really a property of the languages. Perhaps "set of alphabets" would be 
a better way to put it.

Good luck with ISO language codes which does not even
define them, and contain many duplicate codes even in
the Alpha-2 space (he/iw, in/id), or unprecize codes
matching sometimes very imprecize families of languages
overlapping other language codes...
Until it is demonstrated that a language needs such fix
in Unicode support tables, ...
If necessary I can collect some data to demonstrate this, at least for 
some languages.

... it's best to just say that these
fixes are needed for some recognized language codes and
that applications are allowed to add their own "fixes" or
language tailorings, and that the existing language
tailorings in Unicode databases are just non-normative
samples.
-- Philippe.



 

Agreed. But does Unicode actually treat them as non-normative samples?

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-11 Thread Philippe Verdy
On Friday, July 11, 2003 6:43 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> Agreed. But does Unicode actually treat them as non-normative samples?

Note clear here: the reference documents say that these tables are
normative for applications that want to implement a conforming
case folding. But UTR#30 (characters folding) contains still many
areas marked as "to be done", so it is not clear that all folding issues
have been solved. It seems reasonnable however that non language
specific elements in the CaseFolding table are normative, as they
are computed from UCD...

I see this comment:
[quote]
# The entries in this file are in the following machine-readable format:
#
# ; ; ; # 
#
# The status field is:
# C: common case folding, common mappings shared by both simple and full mappings.
# F: full case folding, mappings that cause strings to grow in length. Multiple
characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.
# T: special case for uppercase I and dotted uppercase I
#- For non-Turkic languages, this mapping is normally not used.
#- For Turkic languages (tr, az), this mapping can be used instead of the normal 
mapping for these characters.
#  Note that the Turkic mappings do not maintain canonical equivalence without 
additional processing.
#  See the discussions of case mapping in the Unicode Standard for more 
information.
#
# Usage:
#  A. To do a simple case folding, use the mappings with status C + S.
#  B. To do a full case folding, use the mappings with status C + F.
#
#The mappings with status T can be used or omitted depending on the desired 
case-folding
#behavior. (The default option is to exclude them.)
#
[/quote]

Simple Case Mapping (C+S) is not marked "to be done" in UTR#30, but other special 
mappings with status T are off by default (so they depend of a specific tailoring, a 
non-normative behavior if I interpret it correctly, as applications are free to use or 
not use them, under unspecified conditions, i.e. here the "desired behavior").

This concerns many more characters than just Turkish/Azeri uses, and there is some 
overlap with the informative and unfinished UTR#30 reference:

(1) Simple mappings (are they normative?):

1F88; S; 1F80; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
1F89; S; 1F81; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
1F8A; S; 1F82; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F8B; S; 1F83; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F8C; S; 1F84; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F8D; S; 1F85; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F8E; S; 1F86; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND 
PROSGEGRAMMENI
1F8F; S; 1F87; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND 
PROSGEGRAMMENI

1F98; S; 1F90; # GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
1F99; S; 1F91; # GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
1F9A; S; 1F92; # GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1F9B; S; 1F93; # GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1F9C; S; 1F94; # GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1F9D; S; 1F95; # GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1F9E; S; 1F96; # GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
1F9F; S; 1F97; # GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI

1FA8; S; 1FA0; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
1FA9; S; 1FA1; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
1FAA; S; 1FA2; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
1FAB; S; 1FA3; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
1FAC; S; 1FA4; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
1FAD; S; 1FA5; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
1FAE; S; 1FA6; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND 
PROSGEGRAMMENI
1FAF; S; 1FA7; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND 
PROSGEGRAMMENI

1FBC; S; 1FB3; # GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
1FCC; S; 1FC3; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
1FFC; S; 1FF3; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI

(2) Full mappings (clearly optional):

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0149; F; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

0587; F; 0565 0582; # ARMENIAN SMALL LIGATURE ECH YIWN

1E96; F; 0068 0331; # LATIN SMALL LETTER H WITH LINE BELOW
1E97; F; 0074 0308;

Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter_Constable
> Where does the fact of saying that a Grapheme Disjoiner...

The character you should be referring to is not a new character GDJ, but 
rather is the existing ZWNJ, the functions of which include prevention of 
a ligature.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter Kirk
On 11/07/2003 11:18, Philippe Verdy wrote:

# T: special case for uppercase I and dotted uppercase I
#- For non-Turkic languages, this mapping is normally not used.
#- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters.
 



Is that what is called a "character subset" for a scripted language family? Well I don't like the term "Turkic" to name it. I prefer the more common "Altaic Latin alphabet", seen as a standard subset of the Latin script, with additional properties.

May be Unicode should not try to use language codes for families of languages, but it could define "representative subsets of characters" which may contain characters from several scripts, but would be minimized according to the tradition of a family of languages. Such families seem evident from the current ISO-8859-* and Mac/Windows/DOS charsets.

-- Philippe.

 

Thank you, Philippe. Well, I am glad to read "not normally used" rather 
than "must not be used" as this allows mapping T to be used for other 
languages when appropriate.

I also don't like the word Turkic here. This is a linguistic term for a 
language family, see 
http://www.ethnologue.com/show_family.asp?subid=710. Turkish and Azeri 
are Turkic languages, but there are many Turkic languages which don't 
use this case mapping, either because they use other alphabets 
(Cyrillic, Arabic, occasionally Hebrew, perhaps even Greek) or because 
they use a Latin alphabet with the regular case mapping as in Uzbek and 
Turkmen. There are also some non-Turkic minority languages which need 
the T case mapping. "Altaic Latin alphabet" is a reasonable alternative, 
although again Altaic is a language family name, covering Turkic, 
Mongolian and Tungus, see 
http://www.ethnologue.com/show_family.asp?subid=709, and as far as I 
know mapping T is not needed for any Mongolian or Tungusic languages.

Does anyone know of a good resource on the web, or elsewhere, listing 
the alphabets used for different languages around the world? I know a 
project was attempted a few years ago at least for Europe. It would be 
useful to have this kind of data available somewhere even with no 
official status.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Michael Everson
At 03:25 -0700 2003-07-12, Peter Kirk wrote:

Does anyone know of a good resource on the web, or elsewhere, 
listing the alphabets used for different languages around the world? 
I know a project was attempted a few years ago at least for Europe. 
It would be useful to have this kind of data available somewhere 
even with no official status.
http://www.evertype.com/alphabets
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-12 Thread Peter Kirk
On 12/07/2003 04:18, Michael Everson wrote:

At 03:25 -0700 2003-07-12, Peter Kirk wrote:

Does anyone know of a good resource on the web, or elsewhere, listing 
the alphabets used for different languages around the world? I know a 
project was attempted a few years ago at least for Europe. It would 
be useful to have this kind of data available somewhere even with no 
official status.


http://www.evertype.com/alphabets
Thank you, Michael. I knew you had this information, of course, as I 
helped to provide it, but I didn't know where it was now. This is of 
course restricted to Europe as you have defined it, and is not 
exhaustive for Turkey. Also it doesn't include recent Latin alphabets 
for minority languages of Azerbaijan, as used in schools to a rather 
limited extent, perhaps because I never sent you the data.

The link to http://www.evertype.com/alphabets/azerbaijan.pdf is broken; 
and in http://www.evertype.com/alphabets/turkish.pdf the dotted capital 
I is missing, as viewed in Acrobat Reader 5.1 on Windows 2000.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-11 Thread Doug Ewell
Philippe Verdy  wrote:

> Good luck with ISO language codes which does not even
> define them, and contain many duplicate codes even in
> the Alpha-2 space (he/iw, in/id), or unprecize codes
> matching sometimes very imprecize families of languages
> overlapping other language codes...

The codes "iw" for Hebrew and "in" for Indonesian were deprecated
FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
"duplicates" of "he" and "id".  The Registration Authority deprecates
such codes, rather than deleting them, for backward compatibility with
any data that might contain the old codes.

The part about codes for language families overlapping other codes for
specific languages is, regrettably, true.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Philippe Verdy
On Saturday, July 12, 2003 6:51 AM, Doug Ewell <[EMAIL PROTECTED]> wrote:

> Philippe Verdy  wrote:
> 
> > Good luck with ISO language codes which does not even
> > define them, and contain many duplicate codes even in
> > the Alpha-2 space (he/iw, in/id), or unprecize codes
> > matching sometimes very imprecize families of languages
> > overlapping other language codes...
> 
> The codes "iw" for Hebrew and "in" for Indonesian were deprecated
> FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
> "duplicates" of "he" and "id".  The Registration Authority deprecates
> such codes, rather than deleting them, for backward compatibility with
> any data that might contain the old codes.

I was sure also that "iw" was not used today, until I found that it is
still used in Java on Windows, for legacy reasons... Creating a resource
bundle in Hebrew with the code "he" was simply... ignored. So I had to
rename it to "iw".

Shamely, on Linux or various Unixes the correct code to use for locales
varies, and it comes from the user-environment settings, actually setup
by a system profile, most of the time... Users that want to get the
benefit of existing locales for Hebrew will constantly need to change
between "he" and 'iw". The "normal" installation solution is still today
to create a file link between "he" and "iw" resources, so that they both
can be used.

I was really disappointed when I saw that these legacy language codes
were not simplifiable the way we think, by ignoring "iw" and "in", and still
today, Java does not offer a way to create "links" at runtime to resolve
locales with equivalent ids, without duplicating resources or creating
special rules with: if ( code="he"|| code="iw" )
(don't forget that Java has also run-time resources with no files)...




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Patrick Andries


Samedi 12 juillet à 6h51, Doug Ewell <[EMAIL PROTECTED]> écrivit :

> The codes "iw" for Hebrew and "in" for Indonesian were deprecated
> FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as
> "duplicates" of "he" and "id".  The Registration Authority deprecates
> such codes, rather than deleting them, for backward compatibility with
> any data that might contain the old codes.

Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to
me.
And why was « he » chosen (Herero, Hemba, Hellenic Greek) ?

P.A.





RE: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Jony Rosenne
What has "iw" to with Hebrew?

I wasn't involved with the change, but I'm glad it was done. Java and other
systems probably still use it because they never bothered to check the
latest version of 639. I know for certain that this was the case with one of
the major computer vendors.

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries
> Sent: Saturday, July 12, 2003 2:12 PM
> To: Philippe Verdy; Doug Ewell
> Cc: [EMAIL PROTECTED]
> Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in 
> Turkish and Azeri, was: Accented ij ligatures)
> 
> 
> 
> 
> Samedi 12 juillet à 6h51, Doug Ewell <[EMAIL PROTECTED]> écrivit :
> 
> > The codes "iw" for Hebrew and "in" for Indonesian were deprecated 
> > FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them as 
> > "duplicates" of "he" and "id".  The Registration Authority 
> deprecates 
> > such codes, rather than deleting them, for backward 
> compatibility with 
> > any data that might contain the old codes.
> 
> Just out of curiosity, why was « iw » deprecated ? Seems 
> perfectly fine to me. And why was « he » chosen (Herero, 
> Hemba, Hellenic Greek) ?
> 
> P.A.
> 
> 
> 
> 
> 




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Patrick Andries

Michael Everson" <[EMAIL PROTECTED]> écrivit :

> At 08:11 -0400 2003-07-12, Patrick Andries wrote:
>
> >Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine
to
> >me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ?
>
> Iwrit (iw), being a German transliteration of the name of the Hebrew
> language, and Jiddisch (ji) were both thought (by someone) to be less
> suitable than the English-based "he" and "yi" which replaced them.

This is also what I concluded, but  «iv» for ivrit could have pleased those
who thought the transliteration must be English-based (what a strange
idea!).

P. A.






Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Mark Davis
We did that deliberately. Faced with a situation where a registration
authority changes IDs on a whim -- with no regard to the issues of
stability in software and data -- the best policy is to always use the
old one, and map any new locales to the old one. That way when you
exchange IDs between old and new systems, it all continues to work.
(We did in fact know of the latest version of the standard at the
time.)

(In ICU, we did add a more general-purpose aliasing mechanism, both
for resource bundles and parts thereof.)

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Philippe Verdy" <[EMAIL PROTECTED]>
To: "Doug Ewell" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Saturday, July 12, 2003 00:27
Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish
and Azeri, was: Accented ij ligatures)


> On Saturday, July 12, 2003 6:51 AM, Doug Ewell <[EMAIL PROTECTED]>
wrote:
>
> > Philippe Verdy  wrote:
> >
> > > Good luck with ISO language codes which does not even
> > > define them, and contain many duplicate codes even in
> > > the Alpha-2 space (he/iw, in/id), or unprecize codes
> > > matching sometimes very imprecize families of languages
> > > overlapping other language codes...
> >
> > The codes "iw" for Hebrew and "in" for Indonesian were deprecated
> > FOURTEEN YEARS AGO.  It is not accurate or fair to refer to them
as
> > "duplicates" of "he" and "id".  The Registration Authority
deprecates
> > such codes, rather than deleting them, for backward compatibility
with
> > any data that might contain the old codes.
>
> I was sure also that "iw" was not used today, until I found that it
is
> still used in Java on Windows, for legacy reasons... Creating a
resource
> bundle in Hebrew with the code "he" was simply... ignored. So I had
to
> rename it to "iw".
>
> Shamely, on Linux or various Unixes the correct code to use for
locales
> varies, and it comes from the user-environment settings, actually
setup
> by a system profile, most of the time... Users that want to get the
> benefit of existing locales for Hebrew will constantly need to
change
> between "he" and 'iw". The "normal" installation solution is still
today
> to create a file link between "he" and "iw" resources, so that they
both
> can be used.
>
> I was really disappointed when I saw that these legacy language
codes
> were not simplifiable the way we think, by ignoring "iw" and "in",
and still
> today, Java does not offer a way to create "links" at runtime to
resolve
> locales with equivalent ids, without duplicating resources or
creating
> special rules with: if ( code="he"|| code="iw" )
> (don't forget that Java has also run-time resources with no
files)...
>
>
>




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-12 Thread Philippe Verdy
On Saturday, July 12, 2003 4:17 PM, Jony Rosenne <[EMAIL PROTECTED]> wrote:

> What has "iw" to with Hebrew?
> 
> I wasn't involved with the change, but I'm glad it was done. Java and
> other systems probably still use it because they never bothered to
> check the latest version of 639. I know for certain that this was the
> case with one of the major computer vendors.

In the case of Java, I don't think so. Sun has certainly maintained the
language code simply to avoid breaking existing localizations to
Hebrew of Java-written software, waiting probably for a better way to
locate locales than the fixed "locales path resolution algorithm" which
is part of its core Classes since the beginning.

As long as Java core classes will not use a locale resolver that allows
tuning the resolution algorithm used to load resource bundles, while
also maintaining the compatibility with the existing softwares that
assume that Hebrew resources are loaded with the "iw" language code,
Sun will not change this code.

In IBM ICU4J, there is such an extended resolver, but Sun takes a
long time to approve such proposals, and have it first accepted,
documented, balloted and voted in its JCP program. Of course
Java already includes some parts of ICU, but other things are in
ICU4J are difficult now to integrate in Java, simply because IBM
forgot to modularize ICU so that it can be integrated slowly.
Accepting ICU4J as part of the core is a big decision choice,
because ICU4J is quite large, and there are certainly developers
for Java that would not accept to have 1 aditional MB of data and
classes loaded in each JVM (particularly because the integration
of ICU would affect a lot of core classes for the Java2 platform
now also used for small devices).

For example, it is impossible to integrate the ICU's Normalizer
class in Java without also importing the UChar class and all its
related services for UString, such as transliterators, and
advanced features such as the UCA tailoring rules run-time
compiler. Some ICU open-sourcers, as well as its users seem
to think now that the modularization of ICU is an important but
complex project.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-13 Thread Mark Davis
...
> Of course
> Java already includes some parts of ICU, but other things are in
> ICU4J are difficult now to integrate in Java, simply because IBM
> forgot to modularize ICU so that it can be integrated slowly.
> Accepting ICU4J as part of the core is a big decision choice,
> because ICU4J is quite large, and there are certainly developers
> for Java that would not accept to have 1 aditional MB of data and
> classes loaded in each JVM (particularly because the integration
> of ICU would affect a lot of core classes for the Java2 platform
> now also used for small devices).
...
> For example, it is impossible to integrate the ICU's Normalizer
> class in Java without also importing the UChar class and all its
> related services for UString, such as transliterators, and
...

You are very misinformed about ICU4J.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Philippe Verdy" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 12, 2003 14:45
Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish
and Azeri, was: Accented ij ligatures)


> On Saturday, July 12, 2003 4:17 PM, Jony Rosenne
<[EMAIL PROTECTED]> wrote:
>
> > What has "iw" to with Hebrew?
> >
> > I wasn't involved with the change, but I'm glad it was done. Java
and
> > other systems probably still use it because they never bothered to
> > check the latest version of 639. I know for certain that this was
the
> > case with one of the major computer vendors.
>
> In the case of Java, I don't think so. Sun has certainly maintained
the
> language code simply to avoid breaking existing localizations to
> Hebrew of Java-written software, waiting probably for a better way
to
> locate locales than the fixed "locales path resolution algorithm"
which
> is part of its core Classes since the beginning.
>
> As long as Java core classes will not use a locale resolver that
allows
> tuning the resolution algorithm used to load resource bundles, while
> also maintaining the compatibility with the existing softwares that
> assume that Hebrew resources are loaded with the "iw" language code,
> Sun will not change this code.
>
> In IBM ICU4J, there is such an extended resolver, but Sun takes a
> long time to approve such proposals, and have it first accepted,
> documented, balloted and voted in its JCP program. Of course
> Java already includes some parts of ICU, but other things are in
> ICU4J are difficult now to integrate in Java, simply because IBM
> forgot to modularize ICU so that it can be integrated slowly.
> Accepting ICU4J as part of the core is a big decision choice,
> because ICU4J is quite large, and there are certainly developers
> for Java that would not accept to have 1 aditional MB of data and
> classes loaded in each JVM (particularly because the integration
> of ICU would affect a lot of core classes for the Java2 platform
> now also used for small devices).
>
> For example, it is impossible to integrate the ICU's Normalizer
> class in Java without also importing the UChar class and all its
> related services for UString, such as transliterators, and
> advanced features such as the UCA tailoring rules run-time
> compiler. Some ICU open-sourcers, as well as its users seem
> to think now that the modularization of ICU is an important but
> complex project.
>
> -- 
> Philippe.
> Spams non tolérés: tout message non sollicité sera
> rapporté à vos fournisseurs de services Internet.
>
>
>




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-14 Thread Philippe Verdy
On Monday, July 14, 2003 5:34 AM, Mark Davis <[EMAIL PROTECTED]> wrote:

> ...
> > Of course
> > Java already includes some parts of ICU, but other things are in
> > ICU4J are difficult now to integrate in Java, simply because IBM
> > forgot to modularize ICU so that it can be integrated slowly.
> > Accepting ICU4J as part of the core is a big decision choice,
> > because ICU4J is quite large, and there are certainly developers
> > for Java that would not accept to have 1 aditional MB of data and
> > classes loaded in each JVM (particularly because the integration
> > of ICU would affect a lot of core classes for the Java2 platform
> > now also used for small devices).
> ...
> > For example, it is impossible to integrate the ICU's Normalizer
> > class in Java without also importing the UChar class and all its
> > related services for UString, such as transliterators, and
> ...
> 
> You are very misinformed about ICU4J.

I hae tried several times to do it. It does not work: you may
effectively remove some tables your don't need, but trying
to extract just the normalizer is a real nightmare. I tried it
in the past, and abondonned: too tricky to maintain, and I
retried it recently (one month ago, from its CVS source) and
this was even worse than the first time.

I know that there's now a recent announcement (less than 1
month ago) for its modularization, but it's true that I did not
check the new "modularized" sources. So my application
of ICU4J is still only when I can accept the whole package,
as maintaining a stripped-down customization is too tricky.

But may be this has changed, I just updated my ICU sources
from CVS. I'll recheck it to see if a "ICU Light" version can be
created (which would only keep the core features, without the
support for tailoring rules compiled at run-time).

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)

2003-07-14 Thread Mark Davis
First, you should check again, since a significant amount of work was
done in modularization in 2.6.

Second, the phrase "IBM forgot to modularize ICU" is misleading, at
the least. Unlike some people, who appear to have unbounded time and
energy for, say, writing emails, we have to carefully pick and choose
where we spend our time. Whether very fine-grained modularization is
important depends a great deal on the client's requirements, and must
be traded off against the many other things we could be doing with our
time.

Third, ICU4J is a source product. Saying that it is "impossible to
integrate the ICU's Normalize..." is also misleading, since one can
clearly modify source to remove dependencies on code one doesn't want
to include, if it is not core to the functionality. (Of course, it may
vary in amount of effort that is required.). And transliterators are
not, in any event, required for Normalization.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Philippe Verdy" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, July 14, 2003 11:13
Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish
and Azeri, was: Accented ij ligatures)


> On Monday, July 14, 2003 5:34 AM, Mark Davis <[EMAIL PROTECTED]>
wrote:
>
> > ...
> > > Of course
> > > Java already includes some parts of ICU, but other things are in
> > > ICU4J are difficult now to integrate in Java, simply because IBM
> > > forgot to modularize ICU so that it can be integrated slowly.
> > > Accepting ICU4J as part of the core is a big decision choice,
> > > because ICU4J is quite large, and there are certainly developers
> > > for Java that would not accept to have 1 aditional MB of data
and
> > > classes loaded in each JVM (particularly because the integration
> > > of ICU would affect a lot of core classes for the Java2 platform
> > > now also used for small devices).
> > ...
> > > For example, it is impossible to integrate the ICU's Normalizer
> > > class in Java without also importing the UChar class and all its
> > > related services for UString, such as transliterators, and
> > ...
> >
> > You are very misinformed about ICU4J.
>
> I hae tried several times to do it. It does not work: you may
> effectively remove some tables your don't need, but trying
> to extract just the normalizer is a real nightmare. I tried it
> in the past, and abondonned: too tricky to maintain, and I
> retried it recently (one month ago, from its CVS source) and
> this was even worse than the first time.
>
> I know that there's now a recent announcement (less than 1
> month ago) for its modularization, but it's true that I did not
> check the new "modularized" sources. So my application
> of ICU4J is still only when I can accept the whole package,
> as maintaining a stripped-down customization is too tricky.
>
> But may be this has changed, I just updated my ICU sources
> from CVS. I'll recheck it to see if a "ICU Light" version can be
> created (which would only keep the core features, without the
> support for tailoring rules compiled at run-time).
>
> -- 
> Philippe.
> Spams non tolérés: tout message non sollicité sera
> rapporté à vos fournisseurs de services Internet.
>
>
>