Re: Dutch IJ, again

2003-05-27 Thread Anto'nio Martins-Tuva'lkin
On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:

> even if the Dutch language considers it as a single letter, in a
> way similar to the Spanish "ch"

I see one major difference: When you apply extra wide inter-char
distance, you (should) get, f.i.:

K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k

butE  l  c  h  e and not E  l  ch  e

This is common practice in both spanish and dutch typography, ISTK.

I was told in this forum that the surest way to keep this working in
Unicode texts is to use "ij" for Dutch and plain "ij" for other
languages.

--   .
António MARTINS-Tuválkin,   |  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 459 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |




Re: Dutch IJ, again

2003-05-27 Thread Mark Davis
Well, I don't know who told you, but WORD JOINER only affects
linebreak behavior, not intercharacter spacing.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, May 27, 2003 08:03
Subject: Re: Dutch IJ, again


> On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
>
> > even if the Dutch language considers it as a single letter, in a
> > way similar to the Spanish "ch"
>
> I see one major difference: When you apply extra wide inter-char
> distance, you (should) get, f.i.:
>
> K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k
>
> butE  l  c  h  e and not E  l  ch  e
>
> This is common practice in both spanish and dutch typography, ISTK.
>
> I was told in this forum that the surest way to keep this working in
> Unicode texts is to use "ij" for Dutch and plain "ij" for other
> languages.
>
> -- 
.
> António MARTINS-Tuválkin,
|  ()|
> <[EMAIL PROTECTED]>
||
> R. Laureano de Oliveira, 64 r/c esq.
|
> PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem
|
> +351 917 511 459 carros, parelhas e montes
|
> http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe
|
> http://pagina.de/bandeiras/  a água em todas as fontes
|
>
>
>




Re: Dutch IJ, again

2003-05-27 Thread Philippe Verdy
From: "Mark Davis" <[EMAIL PROTECTED]>
> From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> > On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > > even if the Dutch language considers it as a single letter, in a
> > > way similar to the Spanish "ch"
> >
> > I see one major difference: When you apply extra wide inter-char
> > distance, you (should) get, f.i.:
> > K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k
> > butE  l  c  h  e and not E  l  ch  e
> > This is common practice in both spanish and dutch typography, ISTK.
> > I was told in this forum that the surest way to keep this working in
> > Unicode texts is to use "ij" for Dutch and plain "ij" for other
> > languages.
> 
> Well, I don't know who told you, but WORD JOINER only affects
> linebreak behavior, not intercharacter spacing.

I think he meant  (the zero-width joiner) used as as markup to create a ligated 
variant of a pair of characters in some languages that offer two very distinct forms 
(I think about Brahmic scripts such as Devanagari)...
However it seems that such control character is only needed when this creates 
significantly different glyph variants that actually have distinct interpretation and 
semantics in the corresponding language.




Re: Dutch IJ, again

2003-05-27 Thread Philippe Verdy
From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > even if the Dutch language considers it as a single letter, in a
> > way similar to the Spanish "ch"
> 
> I see one major difference: When you apply extra wide inter-char
> distance, you (should) get, f.i.:
> K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k
> butE  l  c  h  e and not E  l  ch  e
> This is common practice in both spanish and dutch typography, ISTK.
> I was told in this forum that the surest way to keep this working in
> Unicode texts is to use "ij" for Dutch and plain "ij" for other
> languages.

My opinion about this is not related to the use or non-use of joiner and disjoiner 
controls.

I think it goes to the locale definition of breakers (I mean the set of breakers for 
sentences, lines, words, hyphenation):

Shouldn't that go to the definition of locale-specific ***character (or 
character-clusters) breakers***, going beyond what Unicode can provide in a single and 
unified character model that just tries to represent international text independantly 
of the language ? 

After all Unicode mostly defines only the required abstract characters needed to 
encode a given strict, outside of any typographical considerations with fonts and 
style effects, but does not really work on the representation of locale-specific needs 
for specific typographical uses such as line justification...

Once again, Unicode should not attempt to be a markup language. It only represents 
text as a linear stream of abstract characters encoded in strings that can be 
transmitted. Unicode is not specifying the typographic needs. This goes to other 
systems such as HTML, SGML, or XSLT and CSS, plus other internationalization standards 
such as transliteration rules, and domain specific conventions, or even the art of 
text translation...

Regarding your request to handle ij specially in Dutch, nothing forbids a locale-aware 
rendering application to remap the i+j pair as a single ij character before rendering 
it, if the text is labelled as Dutch...

So you could get with a few locale-specific chararacter-cluster breaking rules:
K  o  r  t  r  ij  k and not   K  o  r  t  r  i  j  k
B  i  j  e  c  t  i  e   and notB  ij  e  c  t  i  e
(simply because i+j is a single combined Dutch ij character only if its not followed 
by a vowel)

For the same reason, a French text would render with strict typography:
B  oe  u  fand notB  o e  u  f
(in this case it would render the oe ligature)

Such approach is still much less complicated than what is actually needed for Brahmic 
scripts, and even worse for Thai! And it could handle the defficiencies of some 
conversions to legacy character sets, for example restoring the final form of a greek 
sigma when appropriate.

So the only good question to ask is whever we can label the text with its language, 
using some markup system, or at least using the Unicode language tags needed as a 
possible interface for font renderers that cannot interpret a markup system...

I would not be shocked to see the ligated or combined forms not rendered in a text 
simply because the text is incorrectly marked with the wrong language, or ecause such 
markup is simply not available. This exception is similar to the common approach 
consisting in rendering the text the best as we can with the tools we have, by using 
canonical or compatibility equivalences.

But I see nothing in Unicode that would require the text to be encoded only with the 
Unicode prefered character, only because Unicode recommands it, but where in practice, 
other standards exist that mandate input methods or keyboards where such composition 
is widely impractical. The strict typographic rules cannot be applied without some 
smart algorithm, but the reader will always make the correct interpretation of text 
(this is the interpretation of text that Unicode standardizes, not its rendering).

-- Philippe.




Re: Dutch IJ, again

2003-05-27 Thread Kenneth Whistler
Philippe Verdy continued:

> From: "Mark Davis" <[EMAIL PROTECTED]>
> > From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> > > On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > > > even if the Dutch language considers it as a single letter, in a
> > > > way similar to the Spanish "ch"
> > >
> > > I see one major difference: When you apply extra wide inter-char
> > > distance, you (should) get, f.i.:
> > > K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k
> > > butE  l  c  h  e and not E  l  ch  e
> > > This is common practice in both spanish and dutch typography, ISTK.
> > > I was told in this forum that the surest way to keep this working in
> > > Unicode texts is to use "ij" for Dutch and plain "ij" for other
> > > languages.
> > 
> > Well, I don't know who told you, but WORD JOINER only affects
> > linebreak behavior, not intercharacter spacing.
> 
> I think he meant  (the zero-width joiner) used as as markup to 
> create a ligated variant of a pair of characters in some languages 
> that offer two very distinct forms (I think about Brahmic scripts 
> such as Devanagari)...

No, not ZWJ, either.

U+2060 WORD JOINER (WJ) impacts line breaking behavior -- not the
 applicable concept here.
 
U+200D ZERO WIDTH JOINER (ZWJ) impacts cursive connection and/or
 ligation -- not the applicable concept here.
 
U+034F COMBINING GRAPHEME JOINER (CGJ) is the relevant character.
>From Unicode 4.0:

  "U+034F COMBINING GRAPHEME JOINER is used to indicate that
   adjacent characters are to be treated as a unit for the
   purposes of language-sensitive collation and searching."
   
That function was deliberately limited by the UTC to the status
of such digraphs for searching and sorting, as that was the only
well-defined requirement for the character.

However, as this thread has hinted, there could, in principle,
be multilingual contexts where there would be other legitimate
reasons for treating a digraphic ij (as for Dutch) distinct from
a non-digraphic ij sequence (as for Spanish). That is the same
kind of argument which led to encoding of U+034F for collation.

One can imagine an implementation of automatic letterspacing,
such that a sequence marked explicitly as a digraph would not
expand, but that one not so marked would expand. But such
distinctions would only need to be made in the rather dubious
conditions of: A) Multilingual text that is also B) marked
explicitly for language and that also C) requires different
rules for letterspacing language-by-language. Under such
circumstances, you could indicate the differences for 
either by making use of the U+0133 ij digraph character for
one and  for the other, or you could indicate the
differences by  versus . The first approach
would likely work more easily with existing software, but
results in a problematical representation of Dutch data.
The second is a more generic Unicode approach, but would
likely be ignored by most software.

In any case, the much more likely situation would be software
that did letterspacing for fine typography based just on
Dutch rules. It would not *need* any markup of 
sequences, since it would be looking for and special-casing
the sequences, anyway.

--Ken





Re: Dutch IJ, again

2003-05-28 Thread Pim Blokland
Philippe Verdy schreef:

> i+j is a single combined Dutch ij character only if its not
followed by a vowel

This is not true; where did you get that idea?
It almost always IS a diphtong (cf words like bijen, vrijaf, zijig)
except where the i and the j happen to be in separate syllables
(bijou, bijectie).

Pim Blokland




Re: Dutch IJ, again

2003-05-28 Thread Philippe Verdy
From: "Pim Blokland" <[EMAIL PROTECTED]>
To: "Unicode mailing list" <[EMAIL PROTECTED]>
Sent: Wednesday, May 28, 2003 11:45 AM
Subject: Re: Dutch IJ, again


> Philippe Verdy schreef:
> 
> > i+j is a single combined Dutch ij character only if its not
> followed by a vowel
> 
> This is not true; where did you get that idea?
> It almost always IS a diphtong (cf words like bijen, vrijaf, zijig)
> except where the i and the j happen to be in separate syllables
> (bijou, bijectie).

Do you mean that there is no possible inference rule ? I didnot want to be exaustive 
there, because your sample words where ij is a diphtong effectiely can be exceptions 
(or the two other words may be exceptions to the "normal" Dutch rules). I'm not a 
Dutch expert to be affirmative, I just wanted to give an idea with an example of such 
a possible rule.

Well, it may appear that in general "ij" is always a single diphtong, unless there's 
an hypenation candidate between two syllables. In that case the problem becomes as 
complex as determining syllable breaks for hyphenation.

For now there does not seem to exist a clear definition of what could be a good 
localized breaker for grapheme clusters, as it also implies an analysis of syllables 
in Dutch or other languages (for now, only abjads and Asian scripts seem to have a 
normalized algorithm for the determination of such grapheme clusters, and there 
remains a lot of work to do with alphabetized languages, which seem to use letters in 
a way much more complex than expected).

Still I'm not convinced that the explicit "ij" diphtong is really different from an i 
+ j pair for Dutch, which uses a lexical-based approach (so the combined character 
"ij" may just be there only for compatibility with some legacy usages, as most 
rendering of Dutch text does not allow a reader to make a difference between a 
combined ij cluster and separated i+j letters; the separation does not come from 
letters themselves but from the lexical knowledge of the reader).

The special typographic case of inter-letter spacing for justification is not dramatic 
(because other typographic rules also require that no excessive spacing is used.) 
Exception to this case is the usage of artificially expanded text where the 
typographic effect is used as a way to emphasize a title or mark, and it is very near 
from a logographic design, where the form rather than the semantic is considered more 
important (is such usage still text ? Shouldn't this be excluded from Unicode 
standardization as it requires a necessary markup out of the scope of Unicode, to 
handle this case as a form of typographic *art* ?).

It would be interesting to analyze the way UCA behaves for the collation of Dutch 
text...



Re: Dutch IJ, again

2003-05-29 Thread Peter_Constable

> I think he meant  (the zero-width joiner) used as as markup to
> create a ligated variant of a pair of characters

Whatever happened to CGJ?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485






Re: Dutch IJ, again

2003-05-29 Thread Pim Blokland
Peter Constable schreef:

> Whatever happened to CGJ?

Too new, probably.
People (and software applications) aren't used to this one yet.

Pim Blokland



RE: Dutch IJ, again

2003-05-29 Thread Kent Karlsson


Kenneth Whistler quoted and wrote:
> > > From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> > > > On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > > > > even if the Dutch language considers it as a single  letter,
in a
> > > > > way similar to the Spanish "ch"
> > > >
> > > > I see one major difference: When you apply extra wide inter-char
> > > > distance, you (should) get, f.i.:
> > > > K  o  r  t  r  ij  k and not K  o  r  t  r  i  j  k
> > > > butE  l  c  h  e and not E  l  ch  e
> > > > This is common practice in both spanish and dutch typography,
ISTK.
> > > > I was told in this forum that the surest way to keep  this
working in
> > > > Unicode texts is to use "ij" for Dutch and plain "ij" for
other
> > > > languages.
...
> One can imagine an implementation of automatic letterspacing,
> such that a sequence marked explicitly as a digraph would not
> expand, but that one not so marked would expand. But such
> distinctions would only need to be made in the rather dubious
> conditions of: A) Multilingual text that is also B) marked
> explicitly for language and that also C) requires different
> rules for letterspacing language-by-language. Under such
> circumstances, you could indicate the differences for 
> either by making use of the U+0133 ij digraph character for
> one and  for the other, or you could indicate the
> differences by  versus . The first approach
> would likely work more easily with existing software, but
> results in a problematical representation of Dutch data.
> The second is a more generic Unicode approach, but would
> likely be ignored by most software.

As implied by the quote before Ken's reply, CGJ should NOT affect
letterspacing.

The ij ligature character appears to have a status in-between the
"loathsome" Latin ligature characters in FBxx (I like such ligatures,
and more of them, but not those ligature *characters*, the ligatures
should be generated by the font) as well as the dz/lj/etc. digraphs on
one hand, and the orthographic ligatures, like the ae ligature and oe
ligature, on the other hand.  So I would conclude that **when a
clear distinction need be made** between "Dutch ij" and "other ij",
use the ij ligature character for the "Dutch ij" (and have it mapped
on the keyboard for that purpose), for letterspacing, linebreaking,
and vertical writing.  Though in most cases it should be ok to write
just ordinary ij also for the "Dutch ij".  Complexities like CGJ and ZWJ
seems to be overkill or misapplied here.

/kent k