subject:"RE\: U\+0140"

Re: U+0140

2004-05-06 Thread Anto'nio Martins-Tuva'lkin

On 2004.04.15, 19:47, Kenneth Whistler <[EMAIL PROTECTED]> wrote:

> 0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L; 006C 00B7;...
<...>
> The character *was* in ISO 6937 for Catalan.

And mistakenly so.

> Noting the Catalan association in the Unicode names list is
> different from any recommendation that U+0140 is the preferred
> character for the representation of l followed by a middle dot in
> Catalan text.

But it is surely an excellent way to contribute to the (false) idea
that Unicode doesn't serve the need of minority languages. :-( Is it
so difficult to replace or remove that misleading indication?...

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140 Catalan middle-dot

2004-05-06 Thread Anto'nio Martins-Tuva'lkin

On 2004.04.16, 09:40, Peter Kirk <[EMAIL PROTECTED]> wrote:

> can you describe to me EXACTLY how the shape and behaviour of the
> Catalan middle dot differs from the behaviour of U+2027 <...> This
> strongly suggests that U+2027 is the appropriate character for
> Catalan.

Apparently U+2027 is indeed suitable for spelling Catalan, provided
that it is not ignorable by search and matching routines -- like,
f.i., a soft hyphen is.

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140

2004-04-20 Thread Elaine Keown

  Elaine Keown
  Tucson

Dear Asmus Freytag and Ken Whistler:

I would be pleased if you at Unicode 
would choose to further describe your 
existing middle dot collection.  

I have *no* interest in _more_ middle dots,
enough is enough.

Eventually I hope there *will* be helpful 
notes on which middle dot to use in 
Ancient Hebrew and/or Samaritan Hebrew 
and/or Dead Sea Scrolls Hebrew.

Elaine Keown




__
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
http://photos.yahoo.com/ph/print_splash

Re: U+0140

2004-04-20 Thread Antoine Leca

On Saturday, April 17, 2004 10:28 PM TU+1, AntÃnio Martins-TuvÃlkin wrote:
>> As I wrote earlier, if you know the text under inspection is
>> Catalan, a very simple regular expression will deal with that. Any
>> half-decent Catalan word processor do it already, by the way.
>
> What about the odd Catalan phrase within a text in Guarani or
> Cherokee?

Then, you do not know the text under inspection is Catalan, the "if" is not
asserted, so you are not supposed to act accordingly. That is, nobody will
beg you because a double click on colÂlegi does not select the whole word;
and any reader can test its own word processor, please double click the
Catalan word before, and test if it is recognized as such, even if
surrounded by bad English instead of Guarani!

> Unicode, do not forget, supposedly brings correctness to
> multilingual text...

And then?
Would you try to say that selecting word in multilingual text should always
do the "right thing"? You were merely dreaming, I believe; and you know it
perfectly; having posting less than 2 minutes ago the case of apostrophes,
which is about impossible to sort out in the average multilingual text.
Furthermore, what is "the right thing" varies from people to people, so
achieving perfection here is a mere dream.

Or are you trying to make the point that inventing a new point for Â in
Catalan would bring any added correctness to multilingual texts?

It is certain that the compatibility encoding of U+0140 is not very welcome
from my eyes, since:
 - it is almost unused, but for the case it might be, informaticians like me
do have to check for it: so it is just a waste of my time, I would say :-(
 - one that reads TUS and does not know Spanish uses at the respect, might
think that colÂlegi should be written coÅlegi, "co\u0140legi", because the
former is not listed as a letter, and only the latter references itself as
"Catalan", without mentionning the "right thing to do"
 - the only advantage I am able to see, namely that the typographers will
design the mid dot raised in U+0140 relative to the position it has in
U+00B7, is not exploited in practice; we even see a lot of fonts where the
dot in U+0140 is not balanced between the l, which clearly show that the
majority of typographers have no idea about the use of this character, and
they probably merely build it a compound of U+006C and U+00B7... Others use
a reduced size for the dot in U+0140 (which is unpleasing to my eyes). Only
a few fonts do provide U+0140 with a reduced width for the dot, which might
be considered good typography.

Further note about typography: I have compared on some (widely available)
fonts the layout of Ål versus lÂl and also the upper dot of the colon. I
found that almost nobody use the upper dot of the colon. One of the few I
found, namely Linotype Palatino (I cite it since I generally consider it a
nice design), does use the upper dot of the colon for Å. And the result is
really ugly, because the dot is way too high (about 65% of l-height), thanks
to the modern habbit of the higher x-heights...

Antoine

Re: U+0140

2004-04-19 Thread Asmus Freytag

At 03:49 PM 4/19/2004, Kenneth Whistler wrote:
The Unicode Standard is not prescriptive about rendering, beyond the
basics required to simply ensure correct mapping of textual content
into streams of characters. If one font vendor wants to have a raised
glyph for the MIDDLE DOT and another wants to have a lowered glyph for
the same character, it is not the Unicode Standard's business to put
the two vendors in a room until one gives up and admits the other one
is correct.
I'm sorry but that part of your answer is a bit disingenous in the context 
of the issue most recently discussed on this thread. That involved the case 
of two characters 00B7 and 0387, which have been post-hoc unified via 
canonical equivalence. We are discovering that the vast majority of 
*multi-script* fonts makes a distinction in glyph based on the character 
code (ignoring the canonical equivalence). This therefore is not the simple 
case of a Greek font using a higher dot for 00B7 as an ano teleia and a 
Latin font using a lower one for the mid dot.

We clearly *do* see a variation of treatments of 00B7 across fonts, but in 
all cases that I've seen, these are intended as variations of the middle 
dot, not variations to accommodate the use of this character as ano teleia.

In other words, we have an issue that the equivalence of identity of these 
two characters asserted by Unicode is fundamentally not respected by the 
implementers. And apparently it's not the case of a small minority. I think 
that kind of situation *is* a problem for the standard.

A./

Re: U+0140

2004-04-19 Thread John Hudson

Peter Constable wrote:

And if... someone finds a well documented script
in which a true middle dot and an x-height dot are used contrastively,
That would be a somewhat surprising and not-to-be-recommended design for
a writing system. Not to be completely ruled out, though. But we can
probably wait to cross that encoding bridge when we come to it.
We already have conrasted use of a baseline dot (period or full stop) and a mid-dot (word 
separator or stylistic hyphen), so why would you be surprised by contrasted use of mid-dot 
and x-height dot? Vertical alignment is clearly sometimes a semantic feature. I've seen 
plenty of business cards in which the mid-dot is used as a stylistic division between 
parts of a telephone number instead of spaces, periods or hyphens. I don't like the style, 
but people do it. Presumably some Greek people do it also, in which case they are 
contrasting the mid-dot and the ano teleia.

John Hudson

--

Tiro Typeworkswww.tiro.com
Vancouver, BC[EMAIL PROTECTED]
I often play against man, God says, but it is he who wants
  to lose, the idiot, and it is I who want him to win.
And I succeed sometimes
In making him win.
 - Charles Peguy

Re: U+0140

2004-04-19 Thread Kenneth Whistler

Peter Kirk continued this...

> On 19/04/2004 13:03, Kenneth Whistler wrote:
> 
> >... Those other middle dots give
> >people textual representation alternatives now, if they need to make
> >distinctions, and textual rendering alternatives, if they need to make
> >middle dots which display with slightly different heights, sizes, or
> >spacings, depending on the rendering requirements.
> >  
> >
> 
> Ken, does Unicode specify height, size and spacing distinctions between 
> the various middle dots which you listed? 

No.

> If I understand correctly, it 
> certainly doesn't do so exhaustively. 

Correct.

> So in effect what you are 
> suggesting here is that people make and use their own private 
> distinctions between characters which are not defined by Unicode. 

Not at all.

I am suggesting that people who use Unicode characters *will* use them
according to their identity. However, that doesn't mean that identification
of a character neatly solves all issues of their rendering, nor will it
automatically make things neat and tidy when people use characters in
different contexts which may have different rendering concerns.

The Unicode Standard is not prescriptive about rendering, beyond the
basics required to simply ensure correct mapping of textual content
into streams of characters. If one font vendor wants to have a raised
glyph for the MIDDLE DOT and another wants to have a lowered glyph for
the same character, it is not the Unicode Standard's business to put
the two vendors in a room until one gives up and admits the other one
is correct.

> This 
> sounds very like advising people to ignore Unicode character identiies 
> and properties and do their own thing. Rather strange advice from 
> someone in your position, surely?

I love the way you put positions in peoples' mouths.

By the way, I challenge you to point to the Unicode character properties
in the Unicode Character Database which define the relative position for
middle dots with respect to x-height of a font, or the spacing of
middle dots, for example.

> 
> Surely, in the current situation and if further proliferation of middle 
> dots is considered undesirable, 

It is undesirable, yes.

> users should be advised to presume that 
> distinctions between middle dots are not a plain text matter 

No, they should not. Because the existence of multiple different
middle dots in the standard which are *not* canonical equivalents
of each other makes it a plain text matter.

> and so 
> should be handled by markup, including language selection.

In some cases, yes -- it depends on the effect which is intended,
and the context and application it occurs in.

> 
> And if (as I just suggested on the Hebrew list might be true of some 
> variant Hebrew pointing systems) someone finds a well documented script 
> in which a true middle dot and an x-height dot are used contrastively, 
> the correct approach would be either to accept, reluctantly, that at 
> least one new dot needs to be encoded; or else for Unicode to define 
> clearly which existing character should be used for which dot in this 
> script. 

Or: None of the Above

The users of characters for particular domains bear their own
responsibility to define their usage. It is not up to the Unicode
Consortium to go around defining everyone's spelling rules and
orthographic conventions for them.

If there are things unclear in the standard which are making its
use difficult for people in certain cases, then that is certainly
a concern of the Unicode Technical Committee. And if someone
brings in convincing evidence of the existence of a semantically
significant plain text distinction between two dots that cannot
plausibly be handled by *any* combination of the multitudinous dot
characters already present in the standard, then the UTC might
consider that sufficient justification to encode yet another
middle dot.

Given, however, the fact that there already are so many dot characters,
and given that their rendering often varies by font, the chance of
getting some additional pair of dot distinctions by height on the
line canonized with yet another dot encoding seems unlikely to me.

It is a will-'o-the-wisp to expect any and all multilingual
Unicode text to display "correctly" to any arbitrary n-th degree
of typographical rectitude with any and all Unicode-conformant
fonts. The use of specific fonts with specific designs is
*precisely* to enable plain text (or marked-up text, for that
matter) to be displayed as desired for particular contexts.

The criterion for Unicode plain text is basically *legible*
text. 

> The worst thing that could happen would be for different text 
> providers to make different and incompatible selections among the 
> existing characters, leading to total confusion. But that seems to be 
> the approach which you, Ken, are advocating.

I see. And thank you, Peter, for pointing that error out to me.

Text providers have their own responsibility to ensure that
they are using inte

RE: U+0140

2004-04-19 Thread Peter Constable

> And if... someone finds a well documented script
> in which a true middle dot and an x-height dot are used contrastively,

That would be a somewhat surprising and not-to-be-recommended design for
a writing system. Not to be completely ruled out, though. But we can
probably wait to cross that encoding bridge when we come to it.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: U+0140

2004-04-19 Thread Peter Kirk

On 19/04/2004 13:03, Kenneth Whistler wrote:

... Those other middle dots give
people textual representation alternatives now, if they need to make
distinctions, and textual rendering alternatives, if they need to make
middle dots which display with slightly different heights, sizes, or
spacings, depending on the rendering requirements.
 

Ken, does Unicode specify height, size and spacing distinctions between 
the various middle dots which you listed? If I understand correctly, it 
certainly doesn't do so exhaustively. So in effect what you are 
suggesting here is that people make and use their own private 
distinctions between characters which are not defined by Unicode. This 
sounds very like advising people to ignore Unicode character identiies 
and properties and do their own thing. Rather strange advice from 
someone in your position, surely?

Surely, in the current situation and if further proliferation of middle 
dots is considered undesirable, users should be advised to presume that 
distinctions between middle dots are not a plain text matter and so 
should be handled by markup, including language selection.

And if (as I just suggested on the Hebrew list might be true of some 
variant Hebrew pointing systems) someone finds a well documented script 
in which a true middle dot and an x-height dot are used contrastively, 
the correct approach would be either to accept, reluctantly, that at 
least one new dot needs to be encoded; or else for Unicode to define 
clearly which existing character should be used for which dot in this 
script. The worst thing that could happen would be for different text 
providers to make different and incompatible selections among the 
existing characters, leading to total confusion. But that seems to be 
the approach which you, Ken, are advocating.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-19 Thread Kenneth Whistler

John Hudson responded to Michael Everson:

> Michael Everson wrote:
> 
> >> This would make the mid-dot too high. The top dot of the colon usually 
> >> sits toward the top of the x-height; the *mid*-dot should sit lower, 

> > John, I just don't believe you. I don't believe that in all the history 
> > of Greek and Catalan typography this careful hairsplitting has *always* 
> > taken place; certainly in scientific transcription the HALF TRIANGULAR 
> > COLON is just the top dot in the TRIANGULAR COLON, and in Americanist 
> > transcription where the dot-colons are used instead of triangles I would 
> > say the same applies.
> 
> I never contested that the dots of a colon correspond to the triangles of the 
> linguistic 
> long vowel marker. They clearly do. What I contested was that the typographic 
> mid-point 
> (U+00B7) corresponded to the top dot of a colon. It clearly does not. It is called a 
> mid-point because it sits midway up the x-height. It is used in this position for a 
> variety of stylistic purposes, ...

I think we have two typographers here arguing somewhat at cross-purposes.
Clearly the typographic "mid-point" behaves as John has mentioned, and is
designed as such in many fine fonts (examples seen among the exhibits that
Asmus gathered).

But just a clearly, there is a long, long tradition in Americanist
orthographic practice (which is used widely for linguistic orthographies
outside of Native America as well) of using a "raised dot" for an indication
of vocalic (and occasionally consonantal) length. For 100 years, that
raised dot was mechanically generated by, among other means, filing the
lower dot off a colon key on a mechanical typewriter. (I have such a
typewriter sitting on my desk.) Linguists got used to this raised dot
height, coordinated with a colon in design (which then could be used, among
other things to indicate a prolonged length, when two degrees of length
were in question), and that preference made its way into print, at least
for many North American languages, where the raised dot could be printed
at x-height, rather than at midway up the x-height, which would be too
low for most of the linguistic usage.

Enter the electronic age. ASCII had no MIDDLE DOT. It was period (.), colon (:)
or the highway. Early linguistic material on computers made do with those,
because they had no choice. The IBM PC and the Macintosh introduced a
MIDDLE DOT (0xFA [= IBM CDRA SD63 "Middle Dot"] and 0xE1, respectively).
When ISO 8859-1 was defined, it also had a MIDDLE DOT (0xB7). *Everybody*
made use of that MIDDLE DOT for anything that was vaguely in the ballpark --
the typographical mid-point, the linguistic length mark, the mathematical
multiplication operator, the Greek ano teleia, the dictionary hyphenation
point, and, yes, the Catalan middle dot. The fact that each of those usages 
might have extremely fine typographical hairs to split regarding the rendering
was so much horsepucky as far as the character identity was concerned. You
used what you had available to represent your data.

The Unicode Standard, for a variety of reasons -- some of which included
compatibility mapping concerns to other standards which had started to
proliferate middle dots -- added a collection of middle dots *besides*
U+00B7, *the* middle dot, to its repertoire. Those other middle dots give
people textual representation alternatives now, if they need to make
distinctions, and textual rendering alternatives, if they need to make
middle dots which display with slightly different heights, sizes, or
spacings, depending on the rendering requirements.

What is clear, however, is that it is utterly impossible to satisfy
everybody regarding middle dots. Typographical purists will always want
plain text to make more distinctions. Text processing requirements will
abhor the splitting of text representation into more and more difficult-to-
distinguish glyph representations without clear semantic differences.
And dot proliferation *always* poses difficulty for establishing
character properties.

Before people bluster on too much further on this thread, it would
be good for everyone to recall that the *reason* why U+00B7 has
problematical properties is that it was inherently ambiguous in
*preexisting* usage (that is, prior to Unicode altogether) as punctuation
versus length mark (and other things as well). This puts it in the
same grabbag of very difficult, ambiguous ASCII characters, such as
"~", "*", and "'" which also acquired conflicting usages during their
reign among the small set of available punctuation and symbols in
ASCII.

History has consequences. The history of a character's encoding also
has consequences for how the Unicode Standard is to be used and
interpreted.

--Ken

Re: U+0140

2004-04-19 Thread Adam Twardoch

From: "John Hudson" <[EMAIL PROTECTED]>
> 'Careful hairsplitting' always takes place when people care about
typography.

How very true.

On one hand, there's people who put a cedilla under "a" when typesetting
Polish, on the other hand, there's people who adjust the vertical position
of hyphens when typesetting all-caps. And there's lot in-between. But it is
important to realize that there _always_ were people who adjusted the hyphen
in all-caps settings. Gutenberg's own typesetting was careful hairsplitting.

This is a very typical and essential dilemma, which is one of the reasons
why there is no easy answer to the glyph vs. character question, or more
precisely, why the "character" definition in Unicode is so, well, vague.
Since the decision on what is a "character" and what is "merely" a "glyph
variant" is made somewhat arbitrarily (albeit in a committee process). There
are far too many exceptions to the rule for Unicode to be consistent and
easy-to-use. But since written human language never was consistent and
easy-to-use, I guess it's something very natural and we will all live with
that.

Adam

Re: U+0140

2004-04-18 Thread Mark Davis

>  From Unicode's perspective, the consistent difference in treatment of 00B7
> and 0387 is embarrassing, given the fact of their canonical equivalence.

There are to be sure, features of Unicode that are "embarassing", but I don't
think this is one of them. Take another case: even if consistent practice in
Poland is to have the grave accent in Ã at a different angle than what is
practice in France, that does not make it a mistake for us to have encoded both
as  Ã. These sorts of preferences can be taken into account in the tailoring of
fonts to particular practices, and this issue doesn't not require that we let a
thousand middle dots bloom.

And canonical equivalence was the mechanism for saying that two variants of
character really should never have been encoded (but we had to for compatibility
reasons).

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Asmus Freytag" <[EMAIL PROTECTED]>
To: "Michael Everson" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sat, 2004 Apr 17 15:32
Subject: Re: U+0140


> At 01:54 PM 4/17/2004, Michael Everson wrote:
> >The samples Asmus sent suggest to me that a school of typographers made a
> >set of bad decisions, even if they were really famous and got paid lots of
> >money and their fonts are widely shipped!
>
> In all charity, Michael, your opinion seems to be mainly your personal
> point of view. I'd love to see any evidence of either mid-dot or ano teleia
> being consistently shown the way you claim it should be, but can't find it.
>
> I've attached a second set of samples.
>
> As you can see there are a few fonts, most designed for user interfaces,
> that give 00B7 and 0387 the same treatment. I've put them on the top. The
> rest, and it's a diverse lot, does not.
>
> Also, as to your view of the relation between mid-dot and colon, it's clear
> that this is not readily shared among typographers.
>
>  From Unicode's perspective, the consistent difference in treatment of 00B7
> and 0387 is embarrassing, given the fact of their canonical equivalence.
>
> A./
>
> PS: John had written:
>
> >>This would make the mid-dot too high. The top dot of the colon usually
> >>sits toward the top of the x-height; the *mid*-dot should sit lower,
> >>optically midway up the x-height (which means slightly higher than the
> >>actual halfway mark). The top dot of a colon is typically closer to the
> >>height of the Greek ano teleia, which aligns with the x-height (and which
> >>should align with the cap height in all-cap settings, and with the
> >>small-cap height in smallcap settings).
>
> which pretty much is the way most of the samples have it, but there are
> some interesting differences, esp. among the more decorative fonts.

Re: U+0140

2004-04-17 Thread John Hudson

Michael Everson wrote:

This would make the mid-dot too high. The top dot of the colon usually 
sits toward the top of the x-height; the *mid*-dot should sit lower, 
optically midway up the x-height (which means slightly higher than the 
actual halfway mark). The top dot of a colon is typically closer to 
the height of the Greek ano teleia, which aligns with the x-height 
(and which should align with the cap height in all-cap settings, and 
with the small-cap height in smallcap settings).
John, I just don't believe you. I don't believe that in all the history 
of Greek and Catalan typography this careful hairsplitting has *always* 
taken place; certainly in scientific transcription the HALF TRIANGULAR 
COLON is just the top dot in the TRIANGULAR COLON, and in Americanist 
transcription where the dot-colons are used instead of triangles I would 
say the same applies.
I never contested that the dots of a colon correspond to the triangles of the linguistic 
long vowel marker. They clearly do. What I contested was that the typographic mid-point 
(U+00B7) corresponded to the top dot of a colon. It clearly does not. It is called a 
mid-point because it sits midway up the x-height. It is used in this position for a 
variety of stylistic purposes, e.g. in place of hyphens in phone numbers in stationery, 
which is why most type designers put it at this height. I can assure you that the vast 
majority of type designers don't even know that Catalan uses a dot, let alone that it 
might use this dot.

The obvious solution to present usage is language system typographic tagging, in which a 
distinction can be made in the size, height and spacing of the dot for Catalan and 
non-Catalan use.

'Careful hairsplitting' always takes place when people care about typography.

John Hudson

--

Tiro Typeworkswww.tiro.com
Vancouver, BC[EMAIL PROTECTED]
I often play against man, God says, but it is he who wants
  to lose, the idiot, and it is I who want him to win.
And I succeed sometimes
In making him win.
 - Charles Peguy

Re: U+0140

2004-04-17 Thread Asmus Freytag

At 01:54 PM 4/17/2004, Michael Everson wrote:
The samples Asmus sent suggest to me that a school of typographers made a 
set of bad decisions, even if they were really famous and got paid lots of 
money and their fonts are widely shipped!
In all charity, Michael, your opinion seems to be mainly your personal 
point of view. I'd love to see any evidence of either mid-dot or ano teleia 
being consistently shown the way you claim it should be, but can't find it.

I've attached a second set of samples.

As you can see there are a few fonts, most designed for user interfaces, 
that give 00B7 and 0387 the same treatment. I've put them on the top. The 
rest, and it's a diverse lot, does not.

Also, as to your view of the relation between mid-dot and colon, it's clear 
that this is not readily shared among typographers.

From Unicode's perspective, the consistent difference in treatment of 00B7 
and 0387 is embarrassing, given the fact of their canonical equivalence.

A./

PS: John had written:

This would make the mid-dot too high. The top dot of the colon usually 
sits toward the top of the x-height; the *mid*-dot should sit lower, 
optically midway up the x-height (which means slightly higher than the 
actual halfway mark). The top dot of a colon is typically closer to the 
height of the Greek ano teleia, which aligns with the x-height (and which 
should align with the cap height in all-cap settings, and with the 
small-cap height in smallcap settings).
which pretty much is the way most of the samples have it, but there are 
some interesting differences, esp. among the more decorative fonts. <>

Re: U+0140

2004-04-17 Thread Peter Kirk

On 17/04/2004 13:57, Philippe Verdy wrote:

...

Who's to blame there? Only software designers that have not offered better
keyboards to enter a regular Ano Teleia on Greek keyboards, or accepted
incorrectly to use the approximation between the middle-dot punctuation and the
Greek Ano Teleia. May be the votes from Greek typographers were not heard at the
ISO or UTC decision commitees when such unification was incorrectly decided...
 

In my opinion, the ones to blame are the UTC, for freezing canonical 
equivalences like this, also combining classes, character names etc, 
when they have obviously not been checked in detail with the user community.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-17 Thread Michael Everson

At 09:03 -0700 2004-04-17, John Hudson wrote:
Michael Everson wrote:

So for me, MIDDLE DOT is to COLON as MODIFIER 
LETTER HALF TRIANGULAR COLON is to MODIFIER 
LETTER TRIANGULAR COLON.
This would make the mid-dot too high. The top 
dot of the colon usually sits toward the top of 
the x-height; the *mid*-dot should sit lower, 
optically midway up the x-height (which means 
slightly higher than the actual halfway mark). 
The top dot of a colon is typically closer to 
the height of the Greek ano teleia, which aligns 
with the x-height (and which should align with 
the cap height in all-cap settings, and with the 
small-cap height in smallcap settings).
John, I just don't believe you. I don't believe 
that in all the history of Greek and Catalan 
typography this careful hairsplitting has 
*always* taken place; certainly in scientific 
transcription the HALF TRIANGULAR COLON is just 
the top dot in the TRIANGULAR COLON, and in 
Americanist transcription where the dot-colons 
are used instead of triangles I would say the 
same applies.

António said:

Another nail in the coffin of "use U+00B7 : MIDDLE DOT for Catalan":
Perhaps because it is exclusively used between "L"s (a "high" letter
in both cases), Catalan middot is placed exactly as Michael has it:
The top dot of a colon (careful Catalan typewriter users do/did just
this, erasing or masking the bottom dot of a colon).
This evidence would suggest to me that my analysis is correct.

The samples Asmus sent suggest to me that a 
school of typographers made a set of bad 
decisions, even if they were really famous and 
got paid lots of money and their fonts are widely 
shipped!

But that's just my opinion.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: U+0140

2004-04-17 Thread Philippe Verdy

- Original Message - 
From: "John Hudson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, April 17, 2004 6:03 PM
Subject: Re: U+0140

> Michael Everson wrote:
>
> > I have had suboptimal connectivity over the last while, and so have
> > missed some of this discussion. As a type designer I personally consider
> > the middle dot to be ordinary punctuation that should harmonize with
> > other punctuation marks. My solution to this is to treat it as the top
> > dot of a colon. So for me, MIDDLE DOT is to COLON as MODIFIER LETTER
> > HALF TRIANGULAR COLON is to MODIFIER LETTER TRIANGULAR COLON.
>
> This would make the mid-dot too high. The top dot of the colon usually sits
toward the top
> of the x-height; the *mid*-dot should sit lower, optically midway up the
x-height (which
> means slightly higher than the actual halfway mark). The top dot of a colon is
typically
> closer to the height of the Greek ano teleia, which aligns with the x-height
(and which
> should align with the cap height in all-cap settings, and with the small-cap
height in
> smallcap settings).

So we can see three different vertical positions for this middle-dot, and two
are encoded:

(1) centered at the middle of the x-height and baseline: this is the mathemical
middle-dot symbol, because most mathematical variables are lowercase letters,
making this position appropriate to note a multiplication. There's some large
horizontal gap between the two variables or number, and the horizontal position
is centered between the right edge of the previous character and the left edge
of the next character. This is basically the U+00B7 character which can also be
used as a punctuation mark, notably in dictionnary entries. Its weight should be
the same as the regular dot on the baseline for sentence periods. Note that
Unicode also defines a superfluous mathematical middle-dot symbol (I wonder if
this is caused by the fact that mathematical formulas often happen to use Greek
letters; this symbol at U+22C5 however is thicker, but still thiner than the
bullet operator U+2219, itself thiner than the bullet punctuation U+2219 which
sits on the baseline...)

(2) centered exactly at the x-height: this is the normal position for the
Catalan symbol and for the Greek Ano Teleia. The horizontal gap is minimal, just
enough to make the dot easily distinct when reading, from the two surrounding
character. So the horizontal spacing is smaller than with the middle dot in (1).
One bad thing is that Greek Ano Teleia was unified with the middle dot. If it
had not been so, the Catalan middle dot could have been unified with the Greek
Ano Teleia. It's significant that fonts actually do not respect the unification
of Greek Ano Teleia (2) and the middle-dot symbol or punctuation (1): it
demonstrates that these two should not have been unified with a canonical
equivalence...

(3) the upper dot of the colon or semi-colon is in fact a better position for
the Catalan middle-dot; we can see them as a middle-dot diacritic centered above
another character (a period or comma), but below the upper dot used on lowercase
letters or uppercase letters. For the Catalan middle-dot, the base character
should be the thinest space (sixth of cadratin) whose invisible height would be
the middle of the x-height, under which other baseline punctuations are drawn
(period, comma, connecting underscore. Michael can be right by saying that this
position should match with the vertical position of the hyphen, where in that
case the hyphenation point is probably the best character to use for rendering
the Catalan middle-dot: this dot or hyphen is not centered at the x-height but
just just below it so that the dot fits fully under that x-height with a tiny
vertical gap under it, approximately the weight of the dot or hyphen. A more
exact definition would be computed by using exactly the middle of the M-height.

Characters (2) and (3) are very near from each other, as they are both modifiers
for surrounding letters, and not a symbol or punctuation themselves.

But currently Unicode has unified the first 2 cases, by the canonical
equivalence for Ano Teleia and the middle-dot symbol/punctuation, which is
probably wrong, even if there's a legacy use of U+00B7 on keyboards that
generate ISO 8859 Greek text. The unification in fact comes from the mapping of
the ISO 8859 repertoire to Unicode, at the time when the hyphenation point did
not exist, or possible even before with some legacy mappings between unrelated
ISO 8859 repertoires (notably between Basic-Latin/Greek and Basic-Latin/Latin1).

Who's to blame there? Only software designers that have not offered better
keyboards to enter a regular Ano Teleia on Greek keyboards, or accepted
incorrectly to use the approximation between the middle-dot punctuation and the
Greek Ano Teleia. May be the votes from Greek typographers were n

Re: U+0140

2004-04-17 Thread Anto'nio Martins-Tuva'lkin

On 2004.04.16, 19:34, Antoine Leca <[EMAIL PROTECTED]> wrote:

> As I wrote earlier, if you know the text under inspection is
> Catalan, a very simple regular expression will deal with that. Any
> half-decent Catalan word processor do it already, by the way.

What about the odd Catalan phrase within a text in Guarani or Cherokee?
Unicode, do not forget, supposedly brings correctness to multilingual
text...

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140

2004-04-17 Thread Anto'nio Martins-Tuva'lkin

On 2004.04.16, 14:26, Ernest Cline <[EMAIL PROTECTED]> wrote:

>> From: Antoine Leca <[EMAIL PROTECTED]>
>>
>> ... it is vastly more easy to keep the obvious unification, rather
>> than trying to distort it and trying to make a conditional mapping,
>> if Mathematics, · => U+00B7, if Catalan, · => U+2027, if NoSeQue, ·
>> => some_other_random_middle_dot, etc.
>
> I don't see that as being any worse than the set of HYPHEN_MINUS,
> HYPHEN, MINUS SIGN, etc.

Or -- to bring this back to textual/orthographic widely used kludges
in legation data vs. typographical correctness "from now on" -- than
U+0027 : APOSTROPHE vs. U+02BC : MODIFIER LETTER APOSTROPHE...

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140

2004-04-17 Thread Anto'nio Martins-Tuva'lkin

On 2004.04.17, 17:03, John Hudson <[EMAIL PROTECTED]> wrote:

> Michael Everson wrote:

>> My solution to this is to treat it as the top dot of a colon.
>
> This would make the mid-dot too high. The top dot of the colon
> usually sits toward the top of the x-height; the *mid*-dot should
> sit lower, optically midway up the x-height

Another nail in the coffin of "use U+00B7 : MIDDLE DOT for Catalan":
Perhaps because it is exclusively used between "L"s (a "high" letter
in both cases), Catalan middot is placed exactly as Michael has it:
The top dot of a colon (careful Catalan typewriter users do/did just
this, erasing or masking the bottom dot of a colon).

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

RE: U+0140

2004-04-17 Thread Peter Constable

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
> Behalf Of Kenneth Whistler

> Thanks to Eric and Patrick for digging out my answer on this perennial
> question from a couple years back, and saving me the trouble of
> having to rummage around to find it. :-)
> 
> Also, it should be noted...

Last year, I started putting character stories online. I didn't known
when I started it that I was about to move, so I only got a couple
online and wasn't able to keep adding to it. 

Anyway, I've captured this one and added it to that small but perhaps
growing collection:

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=UnicodeC
haracterStories



Peter Constable

Re: U+0140

2004-04-17 Thread John Hudson

Michael Everson wrote:

I have had suboptimal connectivity over the last while, and so have 
missed some of this discussion. As a type designer I personally consider 
the middle dot to be ordinary punctuation that should harmonize with 
other punctuation marks. My solution to this is to treat it as the top 
dot of a colon. So for me, MIDDLE DOT is to COLON as MODIFIER LETTER 
HALF TRIANGULAR COLON is to MODIFIER LETTER TRIANGULAR COLON.
This would make the mid-dot too high. The top dot of the colon usually sits toward the top 
of the x-height; the *mid*-dot should sit lower, optically midway up the x-height (which 
means slightly higher than the actual halfway mark). The top dot of a colon is typically 
closer to the height of the Greek ano teleia, which aligns with the x-height (and which 
should align with the cap height in all-cap settings, and with the small-cap height in 
smallcap settings).

John Hudson

--

Tiro Typeworkswww.tiro.com
Vancouver, BC[EMAIL PROTECTED]
I often play against man, God says, but it is he who wants
  to lose, the idiot, and it is I who want him to win.
And I succeed sometimes
In making him win.
 - Charles Peguy

Re: U+0140

2004-04-17 Thread Michael Everson

Sent the previous message before it was ready.

At 12:32 -0700 2004-04-15, Kenneth Whistler wrote:

Note that while the particular combination <006C, 00B7, 006C> is a 
peculiarity of Catalan orthography, U+00B7 MIDDLE DOT (often called 
a 'raised period') is
very widely used, indeed, in technical orthographies for many 
languages, particularly in the Americas, where it is used much more 
commonly than the IPA characters U+02D0 MODIFIER LETTER TRIANGULAR 
COLON or U+02D1 MODIFIER LETTER HALF TRIANGULAR COLON to indicate 
vocalic (or less commonly, consonantal) length.
In Cornish lexicography, the middle dot is used regularly to mark the 
vowel of the stressed syllable when it is not penultimate (as it is 
in most words).

I have had suboptimal connectivity over the last while, and so have 
missed some of this discussion. As a type designer I personally 
consider the middle dot to be ordinary punctuation that should 
harmonize with other punctuation marks. My solution to this is to 
treat it as the top dot of a colon. So for me, MIDDLE DOT is to COLON 
as MODIFIER LETTER HALF TRIANGULAR COLON is to MODIFIER LETTER 
TRIANGULAR COLON.

For HYPHENATION POINT I would place its height at whatever the height 
of a HYPHEN was and be done with it.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: U+0140

2004-04-17 Thread Michael Everson

At 12:32 -0700 2004-04-15, Kenneth Whistler wrote:

Note that while the particular combination <006C, 00B7, 006C> is a 
peculiarity of Catalan orthography, U+00B7 MIDDLE DOT (often called 
a 'raised period') is
very widely used, indeed, in technical orthographies for many 
languages, particularly in the Americas, where it is used much more 
commonly than the IPA characters U+02D0 MODIFIER LETTER TRIANGULAR 
COLON or U+02D1 MODIFIER LETTER HALF TRIANGULAR COLON to indicate 
vocalic (or less commonly, consonantal) length.
In Cornish lexicography, the middle dot is used regularly to mark the 
vowel of the stressed syllable when it is not penultimate (as it is 
in most words).

I have had suboptimal connectivity over the last while, but as a type 
designer I personally consider the middle dot to be ordinary 
punctuation that should harmonize with other punctuation marks. My 
solution to this is to treat it as the top dot of a colon. So for me, 
MIDDLE DOT is to COLON as MODIFIER LETTER HALF TRIANGULAR COLON is to 
MODIFIER LETTER TRIANGULAR COLON.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: U+0140 Catalan middle-dot

2004-04-16 Thread Asmus Freytag

At 06:16 PM 4/15/2004, Philippe Verdy wrote:
The other reason is that the middle-dot, being a punctuation, would be 
likely to
have extra spacing on both sides, which would make it inappropriate for
rendering Catalan words. Also such punctuation would probably forbid 
kerning of
the middle-dot within the open area of a uppercase L, something which would be
acceptable for reading Catalan (as it was acceptable with U+2027 in
Teletext/Videotex).
In the sample I just sent out, you'll see that 00B7 in some fonts has a 
rather large space on the left of the dot.

A./

GREEK ANO TELEIA (was: Re: U+0140)

2004-04-16 Thread Alexandros Diamantidis

* Asmus Freytag <[EMAIL PROTECTED]> [2004-04-16 11:28]:
> weight. (See attached sample). If data is normalized, the appearance of ano 
> teleia will change (since 0387 will change into 00B7) and users will be 
> disappointed.

Yes, I know - I've seen professionally published magazines with the
wrong ano teleia glyph, bigger and lower than it should be. It was
probably not even caused by normalization - I think most Greek keyboards
produce 00B7 and not 0387.

Since language-dependend glyph selection isn't very widespread for now,
would it be too much to ask font designers to put a MIDDLE DOT glyph
appropriate for Greek in fonts capable of displaying Greek text?

That's a just a wish, BTW - I don't expect designers to do what I say
just because I sent a message in a mailing list ;-)

-- 
Alexandros Diamantidis * [EMAIL PROTECTED]

Re: U+0140

2004-04-16 Thread Antoine Leca

On Friday, April 16, 2004 12:37 PM, Philippe Verdy va escriure:

> In some future, we could see U+013F and U+0140 used more often than L
> or l plus U+00B7...

I (personally) hope we would not.

> Notably in word processors that can detect these
> sequences in Catalan text and substitute them with the ligatures,
> which create a more acceptable letter form and allows easier text
> handling for (e.g.) word selection in user interfaces and dictionnary
> lookups.

As I wrote earlier, if you know the text under inspection is Catalan, a very
simple regular expression will deal with that. Any half-decent Catalan word
processor do it already, by the way.


> The fact that there's no such L-middle-dot on keyboards should not be
> a limit: word processors have more key bindings and more intelligence
> than the default keys found on keyboards.

Yes yes yes. Particularly when I want to insert afterwards a · between two
ll, when it appears I missed it on the first shot (yes, it happens). Or when
I want to remove a superfluous one that I typed by mistake (yes, it happens
too). With your "intelligence", this latter point will prove being a
headache: on the first shot, a normal user will place the caret just after
the dot, and press Rubout. Slurp, the whole U+0140 is swallowed, but usually
the user will not notice it. So at the second sight (perhaps a lot of time
after, perhaps after an useless additional printout), she will have to type
in the first l.

Intelligent keyboards might be great. But to be so, they have to bring
*much* added value (like, obviously, to be able to type in a language
impossible otherwise; or, more simply, to avoid typing every five minutes
Alt+0156). If they bring only very little value, they are more annoying that
anything else, particularly when they are non permanent but rather operate
from time to time. This would be the case here: as Catalan writer, I type
about texts sometimes in the word processor, where I would be "helped". And
sometimes in the mail reader, or on the console, where I would not, for
example because I do not want to wait two full minutes for the whole
"helpers" to come in everytime I have to type the name of the user of a
given process...


> When I see a Catalan word coded with  it looks very
> ugly (notably with monospaced fonts or in Teletext) and I'm sure that
> Catalan readers don't like the default presentation.

Yes it looks ugly. But this is in fact less ugly for me than seeing l.l or
l-l. Ugliness is in the eye of the beholder, of course. When you are in the
habit of seeing about every hour some rendering of l·l, you will not notice
it. And in fact, I notice more when someone use the kerned version advocated
by Gabriel Valiente, because nowadays it is unusual. And I certainly would
not use the kerned version for some institutional version, because I do not
want to incommodate my readers (this problem showed up about 20 days ago
between us; and there were no debate).


> They will much
> appreciate the support for the ligated 
> encodings.

What do you prefer?

  El col·legi Miguel Hernández de Riola?

  El co[]legi Miguel Hernández de Riola?

([] is ASCII art for a box, which is how many many people would see any use
of U+013F...)


> I don't think they can be considered "compatibility
> characters" just introduced for compatibility with a past ISO
> standard for Videotex and Telelext.

Sorry, you are fighting a lost battle: everyone here do not use them, so all
the corpus is already encoded without them.
The mills of Don Quixote are in Mota del Cuervo, it is only about 200 km
from here, but this is not the Catalan-speaking region ;-).


> The only safe way to change things would then be to have a middle-dot
> diacritic (combining but with combining class 0) to be used instead
> of U+00B7, even if there's no canonical equivalence with the U+013F
> and U+0140 ligatures... A Catalan keyboard would then return this new
> dot instead of U+00B7, and word processors or input method editors
> would easily find a way to represent it using the ligature when it
> follows a L.
[snip]

May I suggest U+1000B7 for this new character?


Antoine

Re: U+0140

2004-04-16 Thread Antoine Leca

On Friday, April 16, 2004 3:26 PM, Ernest Cline va escriure:

> I don't see that as being any worse than the set of HYPHEN_MINUS,
> HYPHEN, MINUS SIGN, etc.

Sorry, I did not make me clear. I am not intenting to say this is undoable,
nor that · case is particularly complex. It is doable (as I showed with the
regular expressions), and it is NOT complex.

I was just saying this is presently not done, and it is IMHO not worth
doing.


> Given the nature of U+0140 (and U+013F) when hyphenated, might it
> not be a good idea to assign these two characters their own Line
> Break class for  the Line Breaking Algorithm of UAX #14?

I do not know if it is a good idea or not (I am not the guys who can argue
on this; furthermore these characters are very infrequent), but your
understanding of the behaviour is correct.


Antoine

Re: U+0140

2004-04-16 Thread Asmus Freytag

At 12:26 AM 4/16/2004, Alexandros Diamantidis wrote:
* Philippe Verdy <[EMAIL PROTECTED]> [2004-04-16 01:22]:
> > U+0387 GREEK ANO TELEIA
> wrong form? it's a small square, and is the greek semicolon, and is then
> separating words.
U+0387 is canonically equivalent to U+00B7. About its shape, whether it's
square or round depends on what the full stop looks like in that font -
they should look exactly the same, only the "ano teleia" (upper dot)
should be at x-height.
If two characters are canonically equivalent, they can't have a 
consistently distinct appearance. Nevertheless, most fonts appear to give a 
different glyph to 0387 than to 00B7, not only in height, but also in 
weight. (See attached sample). If data is normalized, the appearance of ano 
teleia will change (since 0387 will change into 00B7) and users will be 
disappointed.

In any environment where data are normalized, getting the correct 
appearance requires the use of OpenType with language dependent glyph 
selection (and a layout engine that supports this - or the use of a Greek 
specific font.

A./

PS: it's water under the bridge by now, but in my opinion, this is another 
example of questionable unification of punctuation based on considering 
only the 'ink' and not the positioning of it. If one is considering only 
the roughest of plain text, having only a single code for a 'dot somewhere 
in the middle of the line' yields acceptable results, but it does make the 
use of such plain text as back-bone for typographically correct rendering 
unnecessarily difficult. The extreme form of such 'plain text only' 
approach is using ` and ' as stand-in for the single quotes.

However, for paleo punctuation, where there's no comparable established 
typographical tradition requiring consistent differentiation, the use of 
unified punctuation is preferable. <>

Re: U+0140

2004-04-16 Thread Elaine Keown

Elaine Keown
Tucson

Hi,

I kept the amazing list of middle dots listed this
week on the main Unicode list for future reference.  

Hebrew (Hebrew from 1200 B.C.E. - present) needs 
at least 1 middle dot.

Elaine




__
Do you Yahoo!?
Yahoo! Tax Center - File online by April 15th
http://taxes.yahoo.com/filing.html

Re: U+0140 Catalan middle-dot

2004-04-16 Thread Peter Kirk

On 16/04/2004 03:11, Philippe Verdy wrote:

...

Did you read this PDF seriously: ...

No, but I read what I needed to.

... it really discusses about a hack needed to
reposition the middle-dot correctly so that the Catalan dot will:
- not alter the interletter space
- will be drawn on a higher position (approximately at the x-height) than
middle-dot (in the middle of the x-height and baseline), with a horizontal
position that centers it between the vertical stems of the two surrounding l or
L (this makes a difference for the uppercase letter).
 

These are matters for the font. This kind of horizontal and vertical 
kerning can be done easily with modern technologies.

... Most modern text renderers on
computers display the 00B7 incorrectly for Catalan (notably in user interfaces
and in web browsers).
 

This is a matter of fonts, not of renderers. Most modern text renderers 
are capable of displaying either 00B7 or 2027 correctly if the font is 
set up for that, e.g. to display them as ligatures, or to move the dot 
depending on context.

So, for a typographic point of view, the U+013F and U+0140 ligatures ...

If these are ligatures, they don't need their own Unicode code points, 
and such code points should be treated as alphabetic presentation forms, 
included onyl for compatibility reasons.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-16 Thread Bernard Desgraupes



Philippe Verdy wrote:

- in "collège" where 'o' is often pronounced open, unlike in "colatéral" where
"o" is always closed.
Hmm, "collatéral" is written with two l's in french (from latin cum + 
latus).

Bernard

Re: U+0140

2004-04-16 Thread Ernest Cline


> [Original Message]
> From: Antoine Leca <[EMAIL PROTECTED]>
>
> ... it is vastly more easy to keep the obvious unification, rather than
> trying to distort it and trying to make a conditional mapping, if
> Mathematics, · => U+00B7, if Catalan, · => U+2027, if NoSeQue, · =>
> some_other_random_middle_dot, etc. Unlike hyphenation rules (where the
> mapping might very well be · => U+2027, by the way), which are pretty easy
> to pinpoint, tagging Catalan in bulk text is clearly not a easy task. Even
> when considering the fairly restrictive rules for it to occur (requiring
> NFC):

I don't see that as being any worse than the set of HYPHEN_MINUS,
HYPHEN, MINUS SIGN, etc., which depending upon your taste in
such matters could be seen as an example of what to do or what
not to do. That said, let me switch the topic to something almost
completely different.

Given the nature of U+0140 (and U+013F) when hyphenated, might it
not be a good idea to assign these two characters their own Line
Break class for  the Line Breaking Algorithm of UAX #14?  These two
characters if I understand the comments correctly, always provide
a line breaking opportunity after them, but if that line break opportunity
is taken, the dot must disappear, so an implementation that is not
prepared to remove the dot should ignore the opportunity.

Re: U+0140

2004-04-16 Thread Antoine Leca

On Friday, April 16, 2004 12:31 AM, Peter Kirk va escriure:

>> Peter Kirk a écrit :
>>
>>> What is U+2027 intended for? The name suggests that it might be what
>>> is needed for Catalan.
>>
>> Hyphenation point is primarily used to visibly indicate
>> syllabification of words. Syllable breaks are potential line breaking
>> opportunities in the middle of words. The hyphenation point It is
>> mainly used in dictionaries and similar works. When an actual line
>> break falls inside a word containing hyphenation point characters,
>> the hyphenation point is rendered as a regular hyphen at the end of
>> the line.
>
> Well, this sounds just like the required behaviour for Catalan, as
> described by Anto'nio Martins-Tuva'lkin on 28th March. He wrote:
>
>> Something happends when the "L·L" coincides with a soft line end. I'm
>> no expert in Catalan typesetting but IIRC the dot becomes a hyphen,
>> while regular "LL"s cannot be broken.

António is correct.
But this is not the main point of ·. Main point for · is to disambiguate
orthographies. Hyphenation behaviour is only a secondary role.

Besides, it is vastly more easy to keep the obvious unification, rather than
trying to distord it and trying to make a conditional mapping, if
Mathematics, · => U+00B7, if Catalan, · => U+2027, if NoSeQue, · =>
some_other_random_middle_dot, etc. Unlike hyphenation rules (where the
mapping might very well be · => U+2027, by the way), which are pretty easy
to pinpoint, tagging Catalan in bulk text is clearly not a easy task. Even
when considering the fairly restrictive rules for it to occur (requiring
NFC):
/[aAàÀeEéÉèÈiIíÍïÏoOóÓòÒuUúÚ]l·l[aàeéèiíoóòuú]/
/[AÀEÉÈIÍÏOÓÒUÚ]L·L[AÀEÉÈIÍOÓÒUÚ]/

Antoine

Re: U+0140

2004-04-16 Thread Philippe Verdy

From: "Antoine Leca" <[EMAIL PROTECTED]>
> And yes, similarly to Catalan, the emphatic/prolongated l sound is not
> usualy marked.

In French, the emphatic/prolongated l (written with a double l) is usually
marked by altering the phonetic of the preceding vowel, such as

- in "collège" where 'o' is often pronounced open, unlike in "colatéral" where
"o" is always closed.
- if the preceding vowel is a 'e' it is clearly and always pronounced like a 'è'
in "désceller" instead of the neutral 'e' in "déceler".
- If the preceding vowel is a 'i' with another previous vowel the non-final
sequence 'ill' notes a 'y' half-vowel sound like in "maille"; if there's no
vowel before that i, the i is a plain vowel, and the double l is generally non
emphatic like in "ville" (or "village" or the imported English term "grill")
with a long i that shortens the l sound, to compare with "vile" (the feminine
form of the adjective "vil") or "vilénie" where the i is short and the l
emphatic...
- There are known exceptions when i is not preceded by another vowel; between
"mille" (long i, emphatic l) and "grille" (long i, half-vowel 'y')
- With a preceding 'u', "mûle" or "mûlet" or "tubulure" use a short 'ü' sound
and a l which may be emphatic/long if terminal, unlike "bulle" with a long 'u'
sound and a non emphatic short l...

Historically, Catalan and French had the same writing system.

Re: U+0140

2004-04-16 Thread Philippe Verdy

From: "Antoine Leca" <[EMAIL PROTECTED]>
> On Thursday, April 15, 2004 8:16 PM, Philippe Verdy va escriure:
> > I thought it was already answered in this list by a Catalan speaking
> > contributor: the sequence L+middle-dot in Catalan is NOT a combining
> > sequence.
>
> No? Then was is it?  Looks like very much one, to me.

It is more exactly a ligature, not a combining sequence. But the second
character of the ligature works more like a diacritic, and not as a separate
punctuation or symbol.

In some future, we could see U+013F and U+0140 used more often than L or l plus
U+00B7... Notably in word processors that can detect these sequences in Catalan
text and substitute them with the ligatures, which create a more acceptable
letter form and allows easier text handling for (e.g.) word selection in user
interfaces and dictionnary lookups.

The fact that there's no such L-middle-dot on keyboards should not be a limit:
word processors have more key bindings and more intelligence than the default
keys found on keyboards.

When I see a Catalan word coded with  it looks very ugly (notably
with monospaced fonts or in Teletext) and I'm sure that Catalan readers don't
like the default presentation. They will much appreciate the support for the
ligated  encodings. I don't think they can be considered
"compatibility characters" just introduced for compatibility with a past ISO
standard for Videotex and Telelext.

The compatibility decompositions in the UCD are bad suggestions (only fallbacks)
which create problems that did not exist in the Videotex standard (they already
create a problem for internationalized domain names). But now that decomposition
are normative, there's no way to change it in Unicode.

The only safe way to change things would then be to have a middle-dot diacritic
(combining but with combining class 0) to be used instead of U+00B7, even if
there's no canonical equivalence with the U+013F and U+0140 ligatures... A
Catalan keyboard would then return this new dot instead of U+00B7, and word
processors or input method editors would easily find a way to represent it using
the ligature when it follows a L. If such character was added, I would give it
the general category "Mn", a combining class 0, to match linguistic
expectations, and it would work with IRI and IDN as well, and would immediately
work with all basic Unicode text processing without needing an exception for
Catalan. This new character could have a compatibility decomposition into U+00B7
only as a fallback; and the existing ligatures U+013F and U+0140 could be
commented by providing a better decomposition with this new character, than the
compatibility decompositions with U+00B7.

Re: U+0140 Catalan middle-dot

2004-04-16 Thread Philippe Verdy

From: "Peter Kirk" <[EMAIL PROTECTED]>
> On 15/04/2004 18:16, Philippe Verdy wrote:
> >So U+2027 (as well as the U+013F middle-dot found in ISO-8859-1/15) is not
the
> >exact character to represent this middle dot in all usages, ...
>
> Philippe, before jumping to this conclusion, please can you describe to
> me EXACTLY how the shape and behaviour of the Catalan middle dot differs
> from the behaviour of U+2027 defined in Unicode Standard Annex #14,
> http://www.unicode.org/unicode/standard/reports/tr14/tr14-15.html:
>
> > 2027
> > HYPHENATION POINT
> > A hyphenation point is a raised dot, which is used primarily to
> > visibly indicate syllabification of words. Syllable breaks are
> > potential line break opportunities in the middle of words. It is
> > mainly used in dictionaries and similar works. When an actual line
> > break falls inside a word containing hyphenation point characters, the
> > hyphenation point is rendered as a regular hyphen at the end of the line.
> >
>
>  From the descriptions which you and Anto'nio have provided and from
> http://www.tug.org/TUGboat/Articles/tb16-3/tb48vali.pdf, it seems to me
> that the Catalan behaviour is exactly as described for U+2027 in USA
> #14, perhaps because the Catalan usage has been borrowed from dictionary
> usage or vice versa. This strongly suggests that U+2027 is the
> appropriate character for Catalan.

Did you read this PDF seriously: it really discusses about a hack needed to
reposition the middle-dot correctly so that the Catalan dot will:
- not alter the interletter space
- will be drawn on a higher position (approximately at the x-height) than
middle-dot (in the middle of the x-height and baseline), with a horizontal
position that centers it between the vertical stems of the two surrounding l or
L (this makes a difference for the uppercase letter).

So the encoded l-with-middle-dot and L-with-middle-dot, if properly created for
Catalan using these guidelines, will render much better than 'L' or 'l' followed
by U+00B7 and even better than U+2027.

If rendering is not important for you (it matters when one wants to create a
renderer), consider the case of collation, and text analysis. My view about the
precombined ligatures L-with-middle-dot is that their "letter" general category
makes things easier for writers and readers, even if both agree that there's no
such dotted-L letter in Catalan, but clearly a single L with an additional but
separate phonetic mark.

Another point: the middle dot in Catalan seems to be used only between a pair of
L letters. Typographers consider the double L with a middle-dot as a ligature,
and Catalan phonetic uses a dotted pair to change the phonetic (and even the
meaning) of a double-L from the "L mouillé" (where it is pronounced like y
between vowels), to a consonantal palatal L.

Last note: Catalan words starting by a double-L exist, but they apparently never
take a middle dot (because such orthograph always designates a consonnantal
palatal L, sometimes pronounced with some stress or with a audible
palato-lingual click or some prenasalisation; this pronounciation depends on the
4 local dialects spoken)

The phonetic distinction of medial double-L did not exist in medieval Catalan
texts where this mark was not written (like in French). The Catalan middle-dot
was then introduced later with a clear intent to not alter the number of letters
and their relative positions in the typography. Most modern text renderers on
computers display the 00B7 incorrectly for Catalan (notably in user interfaces
and in web browsers).

So, for a typographic point of view, the U+013F and U+0140 ligatures are much
better than their compatibility decomposition. I don't think they can be
described as compatibility characters. So the ISO 6937 standard for Videotex was
right when it defined this ligature to respect the normal typography, but the
compatibility decompositions using U+00B7 in Unicode are certainly not the best
ones (they are widely used today simply because the ligatures were missing in
ISO-8859-1 and in Windows 1252, and there was no other alternative than using
U+00B7 for that function).

Re: U+0140

2004-04-16 Thread Antoine Leca

On Thursday, April 15, 2004 8:16 PM, Philippe Verdy va escriure:
> I thought it was already answered in this list by a Catalan speaking
> contributor: the sequence L+middle-dot in Catalan is NOT a combining
> sequence.

No? Then was is it?  Looks like very much one, to me.

> The middle dot in Catalan plays a role similar to an hyphen
> between syllables, to mark a distinction with words where, for
> example a double-L would create an alternate reading.

Yes (although I am not sure we can write "similar to hyphens", since I do
not know the history of the hyphen).

> The dot indicates that each L must be read distinctly (or read
> with a long or emphatic L).

Ought to. I.e., it would be precious prononciation, at least for the
Barcelonian way of speaking. In other places, the prolongated prononciation
may be the default for litterate speech, too (this is the case here in
Valencia). Colloquial speech definitively makes no difference between l·l
and l.

The very reason for the dot is to disambiguate between two identical
orthographies inherited from the past, without actually changing the
orthographies (i.e., dropping one l, or adopting the standard but bulky "tl"
digraph).
So, "ll" now unambiguously designs palatal l (the IPA code of which I am
presently unable to found in Unicode, it is a turned y), coming form
colloquial words, while "l·l" unambiguously designs may-be-prolongated [l]
directly coming from Latin. Before the reform (~100 years ago), both were
written identically, which leads to problems.


> In French for example we have words like "maille" to be read as
> /maj/, and the same "-ill-" written diphtongs after another vowel
> occur in Catalan.

It is written -i- (not ï nor í), occuring after some vowel. Like "mai"
(never), which is sounded the same as "maille" in Parisian French.

> But French will not write "-ill-" if it occurs
> between two vowels where the two L must have the sound L (if this
> occurs in french, only 1 L is written, and the emphatic/long sound is
> not marked).

Of course not "-ill-" (why on earth someone will introduce an -i- where
there is no reason for it?), but rather "-ll-", like in "collège" or
"parallèle". TWO L's ;-). This is after the two most used words in Catalan
that have the ·, namely "col·legi" and "paral·lel".

And yes, similarly to Catalan, the emphatic/prolongated l sound is not
usualy marked.


> Catalan has this orthograph, and writes the
> emphatic/long L distinctly. So it needs a symbol for that. The
> middle-dot is then considered in Catalan as a letter,

This is not a letter. Not as much as harly anyone will consider apostrophe
as being a letter in Romance languages (or in English either).
Note that I am _not_ saying · is like an apostrophe in Catalan (the latter
is a punctation symbol, which separates words). But it is not a letter.
Neither are ´ or ¸, either.

> that will occur in the middle of words.

Specifically between L (either lower or upper-case, but not a mixture).
There are other rules, too, such as IIRC the letters surrounding the l
should be vowels (Not 100% sure here, and did not care to check).


> I don't know if the middle-dot can be used in Catalan as a cadidate
> position for a line break with hyphenation:

It is.

> if yes, is it kept before
> the hyphen, or is the middle-dot used alone, or is the middle-dot
> replaced by a regular hyphen?

The latter.

> I don't know. But if the middle-dot
> must be replaced by a hyphen, then it is a punctuation (similar to
> hyphens used in compound-words).

What is the first k in a hyphenated "dicke" in German? (it becomes
"dik-ke"). At any rate, I will not tag it as "punctuation"!
Here we are a similar case: when l·l is hyphenated, the former "diglyph",
i.e. "l·", is transformed to "l". The obvious reason is that there is no
more need to disambiguate, since a palatized "ll" will never be hyphenated
in Catalan (nor in Castilian, nor will "lh" in Portuguese or Occitan, nor
will "gli" in Italian).


> But in Catalan, the middle dot should not be kerned into the
> preceding uppercase L, like it would appear if it was considered
> equivalent to .

Sorry, but who are you to dictate laws about kerning in Catalan?
Kerning is essentially an optional feature related to fonts, and I do not
see any reason to avoid "kerning" a L and a · (which would be in a title,
moreover), if the result is aesthetically unpleasant, perhaps becasue the
font designer did not consider the case.


> If there's something really missing for Catalan, it's a middle-dot
> letter with general category "Lo", and combining class 0 (i.e. NOT
> combining). It's unfortunate that almost all legacy Catalan text
> transcoded to Unicode are based on the middle-dot symbol (the one
> mapped in ISO-8859-1 and ISO-8859-15) which is not seen by Unicode as
> a letter (Lo) but as a symbol only.

Considered that the · is present on any Spanish keyboard these days (shift
3), and that on the other hand almost no keyboard except ancient typewriters
do h

Re: U+0140 Catalan middle-dot

2004-04-16 Thread Peter Kirk

On 15/04/2004 18:16, Philippe Verdy wrote:

...

The Catalan middle-dot is a plain orthographic letter and should be treated as
such, and not by borrowing a punctuation sign or symbol which may have other
conflicting uses. What I suggested is that the general category, despite its
weak definition, is still a good indicator of which character to use.
So U+2027 (as well as the U+013F middle-dot found in ISO-8859-1/15) is not the
exact character to represent this middle dot in all usages, ...
Philippe, before jumping to this conclusion, please can you describe to 
me EXACTLY how the shape and behaviour of the Catalan middle dot differs 
from the behaviour of U+2027 defined in Unicode Standard Annex #14, 
http://www.unicode.org/unicode/standard/reports/tr14/tr14-15.html:

2027

	

HYPHENATION POINT

A hyphenation point is a raised dot, which is used primarily to 
visibly indicate syllabification of words. Syllable breaks are 
potential line break opportunities in the middle of words. It is 
mainly used in dictionaries and similar works. When an actual line 
break falls inside a word containing hyphenation point characters, the 
hyphenation point is rendered as a regular hyphen at the end of the line.

Please don't waste our time with further discussion of how various 
dictionaries indicate syllable breaks, especially when they don't use 
U+2027 at all, but rather a vertical line i.e. a quite different character.

From the descriptions which you and Anto'nio have provided and from 
http://www.tug.org/TUGboat/Articles/tb16-3/tb48vali.pdf, it seems to me 
that the Catalan behaviour is exactly as described for U+2027 in USA 
#14, perhaps because the Catalan usage has been borrowed from dictionary 
usage or vice versa. This strongly suggests that U+2027 is the 
appropriate character for Catalan.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-16 Thread Peter Kirk

On 15/04/2004 16:22, Philippe Verdy wrote:

...

There are also, including combining middle dots (most of these listed at
U+00B7):
U+0387 GREEK ANO TELEIA
   

wrong form? it's a small square, and is the greek semicolon, and is then
separating words.
 

This should not be a small square; it should be identical to U+00B7 to 
which it is canonically equivalent.

U+05BC HEBREW POINT DAGESH OR MAPIQ
   

where would you position it according to the Catalan L letter which has a
distinct directionality, and should not inherit of the complexity of the Hebrew
script?
Why isn't there even U+0307 COMBINING DOT BELOW or U+0323 COMBINING DOT ABOVE in
your list?
 

Surely U+05BC doesn't have inherent directionality? I thought that 
combining characters took the directionality of their base characters.

I was only including middle height dots. The list of dots in other 
positions in much longer - at least four others just in Hebrew.

I wasn't seriously suggesting any of these as suitable for Catalan, 
except possibly for HYPHENATION POINT.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-16 Thread Alexandros Diamantidis

* Philippe Verdy <[EMAIL PROTECTED]> [2004-04-16 01:22]:
> > U+0387 GREEK ANO TELEIA
> wrong form? it's a small square, and is the greek semicolon, and is then
> separating words.

U+0387 is canonically equivalent to U+00B7. About its shape, whether it's
square or round depends on what the full stop looks like in that font -
they should look exactly the same, only the "ano teleia" (upper dot)
should be at x-height.

-- 
Αλέξανδρος Διαμαντίδης * [EMAIL PROTECTED]

Re: U+0140

2004-04-15 Thread Mark E. Shoulson

Kenneth Whistler wrote:

00B7;MIDDLE DOT;Po;0;ON;N;
10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
 

 

I was meaning to ask about this.  I'm all over not encoding Yet Another 
middle dot, but I was wondering.  In my research on Samaritan, I've 
found that they frequently write (you guessed it) a middle dot to 
separate words (they like to use space to enable them to do this cool 
columnar writing thing).  I was assuming that this could be conflated 
with someone else's middle-dot-word-separator; would that be U+10101?
   

As far as I am concerned, U+00B7 should be sufficient for that.

I wasn't sure if character properties or whatever made a difference, 
since this is supposed to be a word separator.  Whatever; I'm 
sufficiently confident that THIS dot, at least, won't have to be encoded.

Note that as part of the ongoing work to cover Greek paleographic
needs, a large number of multiple dot punctuation characters are
currently under ballot for addition to 10646 (and Unicode). See
2056, 2058..205E at:
http://www.unicode.org/alloc/Pipeline.html

These are (proposed to be) encoded in the General Punctuation block to 
ensure that *everyone* is clear that their intended use is general, so we
don't have to keep cloning more and more such dot combinations
to handle the dot punctuation for each different paleographic
tradition.

Yeah, everyone uses dots.  Samaritan cantillation has various colons and 
two-dot-leader looking things, and small circles... but also 
combinations, like colon-line, colon-angle, stuff like that.

~mark

Re: U+0140 Catalan middle-dot

2004-04-15 Thread Philippe Verdy

From: "Patrick Andries" <[EMAIL PROTECTED]>
> Philippe Verdy a écrit :
> >From: "Patrick Andries" <[EMAIL PROTECTED]>
> >>Peter Kirk a écrit :
> >>>What is U+2027 intended for? The name suggests that it might be what
> >>>is needed for Catalan.
> >>>[PA] Isn't this the one that should be used in dictionaries ?
> >>>
> >>See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html
> >>2027
> >>HYPHENATION POINT
> >>Hyphenation point is primarily used to visibly indicate syllabification
> >>of words. Syllable breaks are potential line breaking opportunities in
> >>the middle of words. The hyphenation point It is mainly used in
> >>dictionaries and similar works. When an actual line break falls inside a
> >>word containing hyphenation point characters, the hyphenation point is
> >>rendered as a regular hyphen at the end of the line.
> >
> >This last sentence is wrong, at least in my Larousse dictionnaries:
> >
> I believe it simply describes certain practices (Anglo-Saxon, American
> ?), maybe this should be clearer.

This just demonstrate that the "only one dot character fits all" strategy is too
simplist. There are atual usages in such serious publications as very common
dictionnaries, of multiple dots which have their own semantics and rendering
particularities.

The Catalan middle-dot is a plain orthographic letter and should be treated as
such, and not by borrowing a punctuation sign or symbol which may have other
conflicting uses. What I suggested is that the general category, despite its
weak definition, is still a good indicator of which character to use.

So U+2027 (as well as the U+013F middle-dot found in ISO-8859-1/15) is not the
exact character to represent this middle dot in all usages, even if there's a
important legacy history of using the ISO-8859-1 middle-dot in Catalan (or a
legacy use of L-middle-dot in ISO 6937 which was defined just for convenience
with older technologies that could not display acceptably the sequence  in Catalan due to the excessive space. So a ligature was probably
preferable in the Videotex context.) My opinion is that U+2027 already meant in
Teletext or Videotex two abstract characters even for Catalan readers (and this
can explain why there's a compatibility decomposition, as a legacy acceptable
but poor fallback).

The other reason is that the middle-dot, being a punctuation, would be likely to
have extra spacing on both sides, which would make it inappropriate for
rendering Catalan words. Also such punctuation would probably forbid kerning of
the middle-dot within the open area of a uppercase L, something which would be
acceptable for reading Catalan (as it was acceptable with U+2027 in
Teletext/Videotex).

I looked for handwritten forms of two lowercase l with an intermediate middle
dot and it clearly shows that Catalan write them without extra spacing: the dot
fits well within the open area between the connecting baseline and the two
ascending loops (and sometimes it appears as a horizontal or slanted medial
stroke that connect the two loops, or as a ligature of the two lowercase l
letters, or the dot is put within the ascending loop of the first l). I don't
know which form the Catalan children learn at school to write correctly the
three letters, or if they are taught whever this dot is a diacritic or a special
hyphen...
My readings only show that there's no such L-with middle-dot in the Catalan
alphabet, and it is not most often considered as a letter despite it represents
a distinctive sound.

An interesting article about Catalan typesetting with TeX is on:
http://www.tug.org/TUGboat/Articles/tb16-3/tb48vali.pdf

* It is noted that the usual middle dot (which normally appears at half the
baseline and the x-height) is not exactly what is needed for catalan (where it
should be placed at half the current height of the current middle-dot and the
ascender height).
Another feature is that the dot should be at equal distance of the two vertical
stems of lowercase or uppercase L, which keep their normal distance that would
be used in absence of this dot...)
* So the dot is naturally kerned into the first uppercase L, but usually not
between lowercase letters where it takes its space within the inter-letter
spacing.
* It also discusses the allowed hyphenations and their correct rendering...

Re: U+0140

2004-04-15 Thread Kenneth Whistler


> >00B7;MIDDLE DOT;Po;0;ON;N;
> >10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
> >16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;

> I was meaning to ask about this.  I'm all over not encoding Yet Another 
> middle dot, but I was wondering.  In my research on Samaritan, I've 
> found that they frequently write (you guessed it) a middle dot to 
> separate words (they like to use space to enable them to do this cool 
> columnar writing thing).  I was assuming that this could be conflated 
> with someone else's middle-dot-word-separator; would that be U+10101?

As far as I am concerned, U+00B7 should be sufficient for that.

But if you were looking for a punctuation mark distinguished from
U+00B7, specifically for archaic textual practice, my choice
would be U+16EB (and the Runic double dot, U+16EC) as an
alternative. Scripts.txt treats these as common punctuation:

16EB..16ED; Common # Po   [3] RUNIC SINGLE PUNCTUATION..RUNIC CROSS PUNCTUATION

Unfortunately, software may be making over-aggressive assumptions
about script identity in some cases, which can throw off
implementations that pick up punctuation out of another script
block.

Note that as part of the ongoing work to cover Greek paleographic
needs, a large number of multiple dot punctuation characters are
currently under ballot for addition to 10646 (and Unicode). See
2056, 2058..205E at:

http://www.unicode.org/alloc/Pipeline.html

These are (proposed to be) encoded in the General Punctuation block to 
ensure that *everyone* is clear that their intended use is general, so we
don't have to keep cloning more and more such dot combinations
to handle the dot punctuation for each different paleographic
tradition.

--Ken

Re: U+0140

2004-04-15 Thread Asmus Freytag

At 03:31 PM 4/15/2004, Peter Kirk wrote:
[PA] Isn't this the one that should be used in dictionaries ?

See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html
Why are you guys citing the 1999 (!) version of this TR?

It's 2004, Unicode 4.0.1 has been published and we are up to 
http://www.unicode.org/unicode/standard/reports/tr14/tr14-15.html.

While the text is not vastly different, there's at least one textutal fix 
in the section cited.

A./

Re: U+0140

2004-04-15 Thread Patrick Andries

Philippe Verdy a écrit :

From: "Patrick Andries" <[EMAIL PROTECTED]>

 

Peter Kirk a écrit :

   

What is U+2027 intended for? The name suggests that it might be what
is needed for Catalan.
[PA] Isn't this the one that should be used in dictionaries ?
 

See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html
2027
HYPHENATION POINT
Hyphenation point is primarily used to visibly indicate syllabification
of words. Syllable breaks are potential line breaking opportunities in
the middle of words. The hyphenation point It is mainly used in
dictionaries and similar works. When an actual line break falls inside a
word containing hyphenation point characters, the hyphenation point is
rendered as a regular hyphen at the end of the line.
   

This last sentence is wrong, at least in my Larousse dictionnaries:

I believe it simply describes certain practices (Anglo-Saxon, American 
?), maybe this should be clearer.

P. A.

Re: U+0140

2004-04-15 Thread Philippe Verdy

From: "Patrick Andries" <[EMAIL PROTECTED]>


> Peter Kirk a écrit :
>
> > What is U+2027 intended for? The name suggests that it might be what
> > is needed for Catalan.
> > [PA] Isn't this the one that should be used in dictionaries ?
>
> See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html
> 2027
> HYPHENATION POINT
> Hyphenation point is primarily used to visibly indicate syllabification
> of words. Syllable breaks are potential line breaking opportunities in
> the middle of words. The hyphenation point It is mainly used in
> dictionaries and similar works. When an actual line break falls inside a
> word containing hyphenation point characters, the hyphenation point is
> rendered as a regular hyphen at the end of the line.

This last sentence is wrong, at least in my Larousse dictionnaries:

For example, look at the entry for "Blattfeder".
The entry is in fact "Blatt|feder" with a thin vertical line delimiting
radicals.

This entry has a subitem for "Blattlauskäfer", noted:
...   ||   °~laus*-
käfer ...
where the '*' above is in fact the hyphenation point, and the '-' is a regular
hyphen added because there's a line-break (additionnally the degree '°' symbol
indicates that the radical symbolized by the long tilde '~' must have a capital
initial letter.) There is then no mutation of the hyphenation point into a
regular hyphen when there's a line-break. Clearly, the hyphenation point is a
notation that is not part of the normal orthograph, unlike the regular hyphen at
end of lines which would appear in normal texts out of the dictionnary entries,
so when line breaks occur, both symbols are used together.

This hyphenation point, used in German dictionnaries for verbs with particuls or
for nouns and adjectives with prefixes, is thicker than a sentence-ending dot or
period, and drawn above the baseline but it is not a middle-dot as its position
is at the x-eight. It is too thick and too high to be the Catalan middle-dot...

Re: U+0140

2004-04-15 Thread Mark E. Shoulson

Kenneth Whistler wrote:

Philippe opined:

 

If there's something really missing for Catalan, it's a middle-dot letter with
general category "Lo", and combining class 0 (i.e. NOT combining). 
   

The one thing for sure is that the Unicode Standard does not need
to encode more middle dots:
00B7;MIDDLE DOT;Po;0;ON;N;
0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;N;
1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;N;
22C5;DOT OPERATOR;Sm;0;ON;N;
2F02;KANGXI RADICAL DOT;So;0;ON; 4E36N;
302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;N;
30FB;KATAKANA MIDDLE DOT;Pc;0;ON;N;
FE45;SESAME DOT;Po;0;ON;N;
FF65;HALFWIDTH KATAKANA MIDDLE DOT;Pc;0;ON; 30FBN;
10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;N;
2027;HYPHENATION POINT;Po;0;ON;N;
16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
1802;MONGOLIAN COMMA;Po;0;ON;N;
318D;HANGUL LETTER ARAEA;Lo;0;L; 119EN;HANGUL LETTER ALAE A
1D01B;BYZANTINE MUSICAL SYMBOL KENTIMA ARCHAION;So;0;L;N;
I was meaning to ask about this.  I'm all over not encoding Yet Another 
middle dot, but I was wondering.  In my research on Samaritan, I've 
found that they frequently write (you guessed it) a middle dot to 
separate words (they like to use space to enable them to do this cool 
columnar writing thing).  I was assuming that this could be conflated 
with someone else's middle-dot-word-separator; would that be U+10101?

~mark

Re: U+0140

2004-04-15 Thread Philippe Verdy


- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Friday, April 16, 2004 12:03 AM
Subject: Re: U+0140


> On 15/04/2004 12:32, Kenneth Whistler wrote:
>
> >Philippe opined:
> >
> >
> >
> >>If there's something really missing for Catalan, it's a middle-dot letter
with
> >>general category "Lo", and combining class 0 (i.e. NOT combining).
> >>
> >>
> >
> >The one thing for sure is that the Unicode Standard does not need
> >to encode more middle dots:
> >
> >00B7;MIDDLE DOT;Po;0;ON;N;
> >0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;N;
> >1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;N;
> >22C5;DOT OPERATOR;Sm;0;ON;N;
> >2F02;KANGXI RADICAL DOT;So;0;ON; 4E36N;
> >302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;N;
> >30FB;KATAKANA MIDDLE DOT;Pc;0;ON;N;
> >FE45;SESAME DOT;Po;0;ON;N;
> >FF65;HALFWIDTH KATAKANA MIDDLE DOT;Pc;0;ON; 30FBN;
> >10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
> >1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;N;
> >2027;HYPHENATION POINT;Po;0;ON;N;
> >16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
> >1802;MONGOLIAN COMMA;Po;0;ON;N;
> >318D;HANGUL LETTER ARAEA;Lo;0;L; 119EN;HANGUL LETTER ALAE A
> >1D01B;BYZANTINE MUSICAL SYMBOL KENTIMA ARCHAION;So;0;L;N;
> >
> >(and that's not considering the lowered dots "FULL STOP" and the raised
> >dots)
> >
> >
> >
> There are also, including combining middle dots (most of these listed at
> U+00B7):
>
> U+0387 GREEK ANO TELEIA
wrong form? it's a small square, and is the greek semicolon, and is then
separating words.

> U+05BC HEBREW POINT DAGESH OR MAPIQ
where would you position it according to the Catalan L letter which has a
distinct directionality, and should not inherit of the complexity of the Hebrew
script?
Why isn't there even U+0307 COMBINING DOT BELOW or U+0323 COMBINING DOT ABOVE in
your list?

> U+2022 BULLET
too thick, and it is a word-breaking symbol with a candidate line break on
either sides. most often is a bullet at the beginning of a sub-paragraph, but
can be used for example to separate multiple titles (think about titles on
CD-Audio) or dictionaries and lots of publication where it is a symbol mark
which is used as a source anchor for a note.

> U+2024 ONE DOT LEADER
this is a spacing character, mostly a punctuation, and clearly word-breaking...

> U+2219 BULLET OPERATOR
this is a symbol with a evident word break on either sides (think about
mathematical formulas)

> U+2027 HYPHENATION POINT
a good suggestion if this was not a punctuation... What is the exact status of
this character? When I look into the UCD properties I see that:
French name: POINT DE COUPURE DE MOT
GC=Po: punctuation, other [not even a "connecting" Pc like the ASCII
underscore], so a separator of words
CC=0: not combining [OK]
BD=ON: order neutral [OK]

> What is U+2027 intended for? The name suggests that it might be what is
> needed for Catalan.
I think that this is better seen as an annotation used in dictionaries to note
visually the position of candidate syllable breaks, (unlike the soft-hyphen
which is normally not rendered except where the candidate line-break is
realized).

Many dictionnaries prefer a thin vertical line which extends from the descender
to the ascender, and in fact there are fonts where this character is drawn like
this, and which is not the same as the ASCII vertical line which is smaller and
often thicker.) This notation symbol could be used in addition to and
immediately after the Catalan middle-dot...
My Larousse Catalan-French pocket dictionnary uses a very thin vertical line to
mark word terminations and prefix/suffixes, in combination with a orthographic
middle-dot in the Catalan word which is always noted.

Question here: is that vertical line used in Larousse really the same as U+007C?
In the same context I note that the ASCII TILDE (a large version aligned on the
baseline) is used to note the common radical indicated by the vertical line
symbol that separate prefixes and suffixes from the radical of the entry word...

In the same dictionnary, the vertical line is also used, isolately or in a pair,
and surrounded by a cadratin space, as a separator between definition items, to
group them by semantic proximity; but in that case the vertical line is thicker
and does not extend below the baseline, so this separator looks more like a true
U+007C, i.e. a regular punctuation, with candidate line breaks occuring both
before and after it (in fact at the position of the surrounding c

Re: U+0140

2004-04-15 Thread Peter Kirk

On 15/04/2004 15:13, Patrick Andries wrote:

Peter Kirk a écrit :



What is U+2027 intended for? The name suggests that it might be what 
is needed for Catalan.

[PA] Isn't this the one that should be used in dictionaries ?

See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html

2027



HYPHENATION POINT

Hyphenation point is primarily used to visibly indicate 
syllabification of words. Syllable breaks are potential line breaking 
opportunities in the middle of words. The hyphenation point It is 
mainly used in dictionaries and similar works. When an actual line 
break falls inside a word containing hyphenation point characters, the 
hyphenation point is rendered as a regular hyphen at the end of the line.

Well, this sounds just like the required behaviour for Catalan, as 
described by Anto'nio Martins-Tuva'lkin on 28th March. He wrote:

Something happends when the "L·L" coincides with a soft line end. I'm
no expert in Catalan typesetting but IIRC the dot becomes a hyphen,
while regular "LL"s cannot be broken.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-15 Thread Patrick Andries

Peter Kirk a écrit :



What is U+2027 intended for? The name suggests that it might be what 
is needed for Catalan.

[PA] Isn't this the one that should be used in dictionaries ?

See http://www.unicode.org/unicode/standard/reports/tr14/tr14-6.html

2027

	

HYPHENATION POINT

Hyphenation point is primarily used to visibly indicate syllabification 
of words. Syllable breaks are potential line breaking opportunities in 
the middle of words. The hyphenation point It is mainly used in 
dictionaries and similar works. When an actual line break falls inside a 
word containing hyphenation point characters, the hyphenation point is 
rendered as a regular hyphen at the end of the line.

Re: U+0140

2004-04-15 Thread Peter Kirk

On 15/04/2004 12:32, Kenneth Whistler wrote:

Philippe opined:

 

If there's something really missing for Catalan, it's a middle-dot letter with
general category "Lo", and combining class 0 (i.e. NOT combining). 
   

The one thing for sure is that the Unicode Standard does not need
to encode more middle dots:
00B7;MIDDLE DOT;Po;0;ON;N;
0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;N;
1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;N;
22C5;DOT OPERATOR;Sm;0;ON;N;
2F02;KANGXI RADICAL DOT;So;0;ON; 4E36N;
302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;N;
30FB;KATAKANA MIDDLE DOT;Pc;0;ON;N;
FE45;SESAME DOT;Po;0;ON;N;
FF65;HALFWIDTH KATAKANA MIDDLE DOT;Pc;0;ON; 30FBN;
10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;N;
2027;HYPHENATION POINT;Po;0;ON;N;
16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
1802;MONGOLIAN COMMA;Po;0;ON;N;
318D;HANGUL LETTER ARAEA;Lo;0;L; 119EN;HANGUL LETTER ALAE A
1D01B;BYZANTINE MUSICAL SYMBOL KENTIMA ARCHAION;So;0;L;N;
(and that's not considering the lowered dots "FULL STOP" and the raised
dots)
 

There are also, including combining middle dots (most of these listed at 
U+00B7):

U+0387 GREEK ANO TELEIA
U+05BC HEBREW POINT DAGESH OR MAPIQ
U+2022 BULLET
U+2024 ONE DOT LEADER
U+2027 HYPHENATION POINT
U+2219 BULLET OPERATOR
What is U+2027 intended for? The name suggests that it might be what is 
needed for Catalan.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0140

2004-04-15 Thread Philippe Verdy

From: "Kenneth Whistler" <[EMAIL PROTECTED]>
> Philippe opined:
>
> > If there's something really missing for Catalan, it's a middle-dot letter
with
> > general category "Lo", and combining class 0 (i.e. NOT combining).
>
> The one thing for sure is that the Unicode Standard does not need
> to encode more middle dots:
>
> 00B7;MIDDLE DOT;Po;0;ON;N;
> 0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;N;
> 1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;N;
> 22C5;DOT OPERATOR;Sm;0;ON;N;
> 2F02;KANGXI RADICAL DOT;So;0;ON; 4E36N;
> 302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;N;
> 30FB;KATAKANA MIDDLE DOT;Pc;0;ON;N;
> FE45;SESAME DOT;Po;0;ON;N;
> FF65;HALFWIDTH KATAKANA MIDDLE DOT;Pc;0;ON; 30FBN;
> 10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
> 1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;N;
> 2027;HYPHENATION POINT;Po;0;ON;N;
> 16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
> 1802;MONGOLIAN COMMA;Po;0;ON;N;
> 318D;HANGUL LETTER ARAEA;Lo;0;L; 119EN;HANGUL LETTER ALAE A
> 1D01B;BYZANTINE MUSICAL SYMBOL KENTIMA ARCHAION;So;0;L;N;
>
> (and that's not considering the lowered dots "FULL STOP" and the raised
> dots)

In that set there's only one "letter" (1427; Canadian syllabics Final Middle
Stop) which has the wrong script, although it is a appropriate "Lo" that would
find a very unusual application for Catalan.

I forget the rest (including 2027, the hyphenation point, which shamely is a
punctuation, not a letter, and not explicitly "middle", meaning that it would
render inappropriately for Catalan, although it still represents the Catalan
function of this character).

Re: U+0140

2004-04-15 Thread Kenneth Whistler

Philippe opined:

> If there's something really missing for Catalan, it's a middle-dot letter with
> general category "Lo", and combining class 0 (i.e. NOT combining). 

The one thing for sure is that the Unicode Standard does not need
to encode more middle dots:

00B7;MIDDLE DOT;Po;0;ON;N;
0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;N;
1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;N;
22C5;DOT OPERATOR;Sm;0;ON;N;
2F02;KANGXI RADICAL DOT;So;0;ON; 4E36N;
302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;N;
30FB;KATAKANA MIDDLE DOT;Pc;0;ON;N;
FE45;SESAME DOT;Po;0;ON;N;
FF65;HALFWIDTH KATAKANA MIDDLE DOT;Pc;0;ON; 30FBN;
10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;N;
1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;N;
2027;HYPHENATION POINT;Po;0;ON;N;
16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;N;
1802;MONGOLIAN COMMA;Po;0;ON;N;
318D;HANGUL LETTER ARAEA;Lo;0;L; 119EN;HANGUL LETTER ALAE A
1D01B;BYZANTINE MUSICAL SYMBOL KENTIMA ARCHAION;So;0;L;N;

(and that's not considering the lowered dots "FULL STOP" and the raised
dots)

> It's
> unfortunate that almost all legacy Catalan text transcoded to 
> Unicode are based
> on the middle-dot symbol (the one mapped in ISO-8859-1 and ISO-8859-15)
> which is
> not seen by Unicode as a letter (Lo) but as a symbol only.

Actually, that is *fortunate*, not unfortunate, since it is the
correct conversion from 8859-1 (and Windows 1252) data.

How U+00B7 behaves in Catalan data is then a matter of local
*adaptation* of software for the correct handling of the Catalan
language.

Note that while the particular combination <006C, 00B7, 006C> is
a peculiarity of Catalan orthography, U+00B7 MIDDLE DOT (often
called a 'raised period') is
very widely used, indeed, in technical orthographies for many
languages, particularly in the Americas, where it is used much
more commonly than the IPA characters U+02D0 MODIFIER LETTER
TRIANGULAR COLON or U+02D1 MODIFIER LETTER HALF TRIANGULAR COLON
to indicate vocalic (or less commonly, consonantal) length.

Obsessing about the behavior of U+00B7 in Catalan data while
ignoring its use as a vowel length indicator in many, many
other orthographies is rather pointless, it seems to me.

--Ken

Re: U+0140

2004-04-15 Thread Kenneth Whistler


> Did you get an answer on this ? Why is there no decomposition associated 
> to this character ?

Thanks to Eric and Patrick for digging out my answer on this perennial
question from a couple years back, and saving me the trouble of
having to rummage around to find it. :-)

Also, it should be noted that there *is* a decomposition for
U+0140 in the Unicode Character Database, to wit:

0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L; 006C 00B7;...
 ^^
 
It is a compatibility decomposition for two reasons: the decomposition
into the sequence <006C, 00B7> may result in rendering differences
(both because of potentially different decisions about where the
render the dot and because the introduction of the U+00B7 MIDDLE DOT
might impact line break decisions, depending on the implementation);
secondly, the properties of the characters in the sequence
<006C, 00B7> are distinct from those for <0140> by itself, and
may impact things such as identifier parsing, again, depending on
an implementation. And, as I indicated before, U+0140 is itself
basically a compatibility character, introduced for mapping to
ISO 6937, a preexisting standard that was among the list of
character encoding standards intended to be covered by the initial
Unicode repertoire.

The character *was* in ISO 6937 for Catalan. Noting the Catalan
association in the Unicode names list is different from any
recommendation that U+0140 is the preferred character for the
representation of l followed by a middle dot in Catalan text.
Most existing Catalan data (8859-1, Windows 1252, primarily)
would not use it, of course. Converted to Unicode, that data would
also not use it, but be represented as the sequence <006C, 00B7>.
And there is every expectation that new data created in Unicode
would continue to use such a sequence for Catalan.

--Ken

Re: U+0140

2004-04-15 Thread Patrick Andries

Kenneth Whistler a écrit :

Did you get an answer on this ? Why is there no decomposition associated 
to this character ?
   

Thanks to Eric and Patrick for digging out my answer on this perennial
question from a couple years back, and saving me the trouble of
having to rummage around to find it. :-)
Also, it should be noted that there *is* a decomposition for
U+0140 in the Unicode Character Database, to wit:
0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L; 006C 00B7;...
^^
 

Oops. Looked at the wrong place in BabelMap.

Sorry (blushing).

Patrick

Re: U+0140

2004-04-15 Thread Philippe Verdy

From: "Patrick Andries" <[EMAIL PROTECTED]>
> Anto'nio Martins-Tuva'lkin a écrit :
> >>However I advise removal of the note "Catalan" under U+0140 and
> >>U+013F, and perhaps replacement of the whole note with «for Catalan
> >>use U+006C U+00B7» (resp. U+004C).
> >>
> Did you get an answer on this ? Why is there no decomposition associated
> to this character ?
>
> Also did somewhat mention why U+0140 is even in Unicode since it could
> be considered (by ignorami like me) as a precomposed character (l +
> middle dot) ? Is it due to the polysemy of the middle dot ?

I thought it was already answered in this list by a Catalan speaking
contributor: the sequence L+middle-dot in Catalan is NOT a combining sequence.
The middle dot in Catalan plays a role similar to an hyphen between syllables,
to mark a distinction with words where, for example a double-L would create an
alternate reading. The dot indicates that each L must be read distinctly (or
read with a long or emphatic L).

In French for example we have words like "maille" to be read as /maj/, and the
same "-ill-" written diphtongs after another vowel occur in Catalan. But French
will not write "-ill-" if it occurs between two vowels where the two L must have
the sound L (if this occurs in french, only 1 L is written, and the
emphatic/long sound is not marked). Catalan has this orthograph, and writes the
emphatic/long L distinctly. So it needs a symbol for that. The middle-dot is
then considered in Catalan as a letter, that will occur in the middle of words.

I don't know if the middle-dot can be used in Catalan as a cadidate position for
a line break with hyphenation: if yes, is it kept before the hyphen, or is the
middle-dot used alone, or is the middle-dot replaced by a regular hyphen? I
don't know. But if the middle-dot must be replaced by a hyphen, then it is a
punctuation (similar to hyphens used in compound-words).

But in Catalan, the middle dot should not be kerned into the preceding uppercase
L, like it would appear if it was considered equivalent to .
Catalan has no use of such decomposition, and if such decomposition had existed,
it would have been into L + combining left-middle-dot, and not the same
character.

If there's something really missing for Catalan, it's a middle-dot letter with
general category "Lo", and combining class 0 (i.e. NOT combining). It's
unfortunate that almost all legacy Catalan text transcoded to Unicode are based
on the middle-dot symbol (the one mapped in ISO-8859-1 and ISO-8859-15) which is
not seen by Unicode as a letter (Lo) but as a symbol only.

Re: U+0140

2004-04-15 Thread Patrick Andries

Philippe Verdy a écrit :

From: "Patrick Andries" <[EMAIL PROTECTED]>
 

Anto'nio Martins-Tuva'lkin a écrit :
   

However I advise removal of the note "Catalan" under U+0140 and
U+013F, and perhaps replacement of the whole note with «for Catalan
use U+006C U+00B7» (resp. U+004C).
   

Did you get an answer on this ? Why is there no decomposition associated
to this character ?
Also did somewhat mention why U+0140 is even in Unicode since it could
be considered (by ignorami like me) as a precomposed character (l +
middle dot) ? Is it due to the polysemy of the middle dot ?
   

I thought it was already answered in this list by a Catalan speaking
contributor: the sequence L+middle-dot in Catalan is NOT a combining sequence.
Are you referring to the person I quoted ? Why doesn't the U+0140 have 
decomposition in Unicode ?

P. A.

Re: U+0140

2004-04-15 Thread Ernest Cline




> [Original Message]
> From: Patrick Andries <[EMAIL PROTECTED]>
>
> did somewhat mention why U+0140 is even in Unicode since it could 
> be considered (by ignorami like me) as a precomposed character
> (l + middle dot) ? Is it due to the polysemy of the middle dot ?

More likely it is due to this character being found in legacy
character encodings.  While I don't believe it is in any of the
ISO 8859 character sets, this character (and U+013F) is found
in the ISO 6937 Videotex standard, and probably others as well.

Re: U+0140

2004-04-15 Thread Patrick Andries

Patrick Andries a écrit :

Anto'nio Martins-Tuva'lkin a écrit :

However I advise removal of the note "Catalan" under U+0140 and
U+013F, and perhaps replacement of the whole note with «for Catalan
use U+006C U+00B7» (resp. U+004C).
Did you get an answer on this ? Why is there no decomposition 
associated to this character ?

Also did somewhat mention why U+0140 is even in Unicode since it could 
be considered (by ignorami like me) as a precomposed character (l + 
middle dot) ? Is it due to the polysemy of the middle dot ?
[PA] In the meantime Eric Muller forwarded some answers (dating back 
from 6/8/2002) where Ken explains this all. Thank you Eric.

«

There is no particular reason to use the
l· as a single character, when all the 8859-based and Windows 1252
implementations would be using U+00B7 for the middle dot.
Consider U+0140 as effectively a compatibility character for
ISO 6937. It is mapped to 0xF7 in that standard. It is also
mapped to 0xA9A8 in Code Page 949 (Korean) -- which probably got
it from ISO 6937 in the first place.

Is U+0140 used in other languages?
 

Not that I know of.

--Ken
»
Patrick

Re: U+0140

2004-04-15 Thread Patrick Andries

Anto'nio Martins-Tuva'lkin a écrit :

However I advise removal of the note "Catalan" under U+0140 and
U+013F, and perhaps replacement of the whole note with «for Catalan
use U+006C U+00B7» (resp. U+004C).
Did you get an answer on this ? Why is there no decomposition associated 
to this character ?

Also did somewhat mention why U+0140 is even in Unicode since it could 
be considered (by ignorami like me) as a precomposed character (l + 
middle dot) ? Is it due to the polysemy of the middle dot ?

P. .A

Re: U+0140

2004-03-29 Thread Anto'nio Martins-Tuva'lkin

On 2004.03.28, 22:25, Philippe Verdy <[EMAIL PROTECTED]> wrote:

>> More like a letter, from a typography point of view.
>
> Not really, if it can be freely changed into a regular hyphen at
> line breaks; now your comments interestingly makes me think about a
> explicit and visible syllable break.

"If" indeed -- something of which I am not sure about; for I wrote:

>> I'm no expert in Catalan typesetting but IIRC the dot becomes a
>> hyphen, while regular "LL"s cannot be broken.) I could ask about
>> this in Catalonia

"IIRC" means "if I recall correctly". So, do speculate at will, but
please do not misquote me!

The substance of "Catalan middle dot vs. hypenation" is interesting
but not relevant for the asked editing of the comments under U+0140 in
the standard.

OTOH, I maintain that Catalan middle dot is indeed to be treated like
a letter, from a typography point of view -- namely for word counting
and selecting purposes.

> I suppose that in Catalan, one could use the middle dot to mark this
> syllable break in words like "kilo.octet".

OK, now you are genuinely joking, aren't you?! If you haven't yet
grasped it earlier, Catalan middle dot is to be found only between
"L"s, for the already explained reasons. I see that one must indeed
take your statements in matters unknown with a very liberal pinch of
salt. :-(

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140

2004-03-28 Thread Philippe Verdy


- Original Message - 
From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Sunday, March 28, 2004 7:02 PM
Subject: Re: U+0140


> On 2004.03.27, 11:12, Philippe Verdy <[EMAIL PROTECTED]> wrote:
>
> >>> This becomes evident when composing with extra-space between
> >>> letters: there is no "tie" between the first "L" and the dot.
> >
> > Interesting comment, because I had always thought that this
> > middle-dot was a modifier of the previous L,
>
> That was apparently the whole idea behind the first implementation of
> this chararcter. (Where does it come from? MacWestern? No ISO:8859
> covers it, AFAIK.)
>
> > and I didn't think about syllabic hyphenation.
>
> Your're not supposed to. But people creating encoding should have done
> more than just grab glyphs from assorted text. (Too bad that the few
> people who can do it seriously are not rewarded for it...)
>
> >>> Using this character for Catalan texts additionally causes
> >>> hyphenation problems.
> >
> > So what would be the "hyphenation problems"?
>
> Something happends when the "L·L" coincides with a soft line end. I'm
> no expert in Catalan typesetting but IIRC the dot becomes a hyphen,
> while regular "LL"s cannot be broken.
>
> I could ask about this in Catalonia, as also many of us, bvut it falls
> outside the scope of Unicode.
>
> > Also what is the normal placement of the middle-dot after a
> > uppercase L letter, doesn't it kern into the space above the
> > horizontal bar?
>
> Kerning is kerning, right. What is the normal placement of a "V" after
> an "A", or a "º" after a "."?... Thsey are separate characters, and
> kerning is not a matter for Unicode.
>
> > If I understand what you say here, that it's not a diacritic that
> > modifies that first L,
>
> Yes, it is not.
>
> > so that this middle-dot is effectively a orthographic hyphen similar
> > in essence to other orthographic hyphens that are used to create
> > compound words, or to mark the inversion of the verb and pronominal
> > subject
>
> More or less, yes. But while this kind of hyphens and apostrophes
> separate two "words", the Catalan middle do between two "L"s does not.
>
> > But in that case, is that middle-dot to be considered as a regular
> > punctuation mark in Catalan?
>
> More like a letter, from a typography point of view.

Not really, if it can be freely changed into a regular hyphen at line breaks;
now your comments interestingly makes me think about a explicit and visible
syllable break.

Not not too far from the hyphen used between two parts of
a compound word (which interestingly tends to disappear in modern
orthographs of lots of compound words, such as "presse-papier" in French
where the hyphen is needed between what is originately a verb and a nound
to build a single noun, and that some write now as a single word
"pressepapier" as it simplifies the rule for plural marks, or for neologisms
like "kilo-octet" more often written now "kilooctet" even though it causes
problems for the separate pronunciation of the double vowel "oo").
I suppose that in Catalan, one could use the middle dot to mark this
syllable break in words like "kilo.octet".

But the question of word-breaks is highly context-sensitive and language-
dependant. It's hard to tell from a hyphen such as the one in the previous
line, if it's a word-break hyphen or a compound-word composing hyphen.
- Just look at this paragraph and you'll see several hyphens whose meaning
differs even in English here. ;-)

Re: U+0140

2004-03-28 Thread Anto'nio Martins-Tuva'lkin

On 2004.03.27, 11:12, Philippe Verdy <[EMAIL PROTECTED]> wrote:

>>> This becomes evident when composing with extra-space between
>>> letters: there is no "tie" between the first "L" and the dot.
>
> Interesting comment, because I had always thought that this
> middle-dot was a modifier of the previous L,

That was apparently the whole idea behind the first implementation of
this chararcter. (Where does it come from? MacWestern? No ISO:8859
covers it, AFAIK.)

> and I didn't think about syllabic hyphenation.

Your're not supposed to. But people creating encoding should have done
more than just grab glyphs from assorted text. (Too bad that the few
people who can do it seriously are not rewarded for it...)

>>> Using this character for Catalan texts additionally causes
>>> hyphenation problems.
>
> So what would be the "hyphenation problems"?

Something happends when the "L·L" coincides with a soft line end. I'm
no expert in Catalan typesetting but IIRC the dot becomes a hyphen,
while regular "LL"s cannot be broken.

I could ask about this in Catalonia, as also many of us, bvut it falls
outside the scope of Unicode.

> Also what is the normal placement of the middle-dot after a
> uppercase L letter, doesn't it kern into the space above the
> horizontal bar?

Kerning is kerning, right. What is the normal placement of a "V" after
an "A", or a "º" after a "."?... Thsey are separate characters, and
kerning is not a matter for Unicode.

> If I understand what you say here, that it's not a diacritic that
> modifies that first L,

Yes, it is not.

> so that this middle-dot is effectively a orthographic hyphen similar
> in essence to other orthographic hyphens that are used to create
> compound words, or to mark the inversion of the verb and pronominal
> subject

More or less, yes. But while this kind of hyphens and apostrophes
separate two "words", the Catalan middle do between two "L"s does not.

> But in that case, is that middle-dot to be considered as a regular
> punctuation mark in Catalan?

More like a letter, from a typography point of view.

> Which category would you use to describe this character,
> independantly of the current assignment of U+00B7?

Something that does not counts "Paral·lel" as two words (while
"jaime" or "its" may be two words), nor uses the middle dot for
cursor stop point when goind Ctrl+arrow et c.

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|

Re: U+0140 (was: "Re: Public Review Issue Update")

2004-03-27 Thread Philippe Verdy

> On 2004.03.26, 23:37, Rick McGowan <[EMAIL PROTECTED]> wrote:
> > The Unicode Technical Committee has posted new issues for public
> > review and comment. Details are on the following web page:
>
> I just added the following to the On-Line Report Form:
>
> > U+0140 : LATIN SMALL LETTER L WITH MIDDLE DOT, approx. similar to
> > U+006C U+00B7, is said to be used for Catalan. That is not correct.
> > Catalan usual orthography uses a regular middle dot to separate two
> > "L"s in those cases where they are pronounced as a single one,
> > doubled only for etymological reasons.
> >
> > This dot is not connected to the previous "L" in any way, as if it
> > were some kind of diacritical. It is a standalone character -- akin
> > to the hyphen in French or Portuguese.
> >
> > This becomes evident when composing with extra-space between
> > letters: there is no "tie" between the first "L" and the dot.

Interesting comment, because I had always thought that this middle-dot was a
modifier of the previous L, and I didn't think about syllabic hyphenation.

> > Using this character for Catalan texts additionally causes
> > hyphenation problems.

So what would be the "hyphenation problems"?

Do you mean that when there's a line break opportunity between 
and , no additional hyphen mark should be inserted because the middle-dot is
already the appropriate hyphen to mark that the word is not terminated at the
line break?

Also what is the normal placement of the middle-dot after a uppercase L letter,
doesn't it kern into the space above the horizontal bar?

If I understand what you say here, that it's not a diacritic that modifies that
first L, so that this middle-dot is effectively a orthographic hyphen similar in
essence to other orthographic hyphens that are used to create compound words, or
to mark the inversion of the verb and pronominal subject in french questions
(sometimes with an added phonetic "t" as in "pense-t-il?" , or to the apostrophe
used to mark an ellision of some final letters in many languages ("j'aime", "je
t'aime" in French, similar examples in Italian) or leading letters ("it's") or
even some medial letters ("they aren't" in English).

But in that case, is that middle-dot to be considered as a regular punctuation
mark in Catalan? Which category would you use to describe this character,
independantly of the current assignment of U+00B7?

65 matches

Mail list logo