Re: Biblical Hebrew

2003-06-26 Thread Kenneth Whistler
John Hudson wrote:

> At 03:52 PM 6/26/2003, Rick McGowan wrote:
> 
> >I'll weigh in to agree with Ken here. The solution of cloning a whole set
> >of these things just to fix combining behavior is, to understate, not quite
> >nice.
> 
> No, but would be far from the not nicest thing in Unicode, and there's a 
> really good reason for it. I was originally intrigued by Ken's ZWJ idea -- 
> or by a variant of it using some new re-ordering inhibiting character, to 
> avoid overloading ZWJ any further --, but the more I think about it, the 
> more not nice I think it is to force Biblical scholars to carry the can for 
> errors in the Unicode combining classes.

One of the reasons I keep poking around for alternatives that might
work in a different way is that cloning sets of characters this
way has a way of just displacing the problem. You don't want to
force Biblical scholars to "carry the can" for the errors in
the current combining classes...

But who then does end up carrying the can eventually, if we go
the cloning route? Cloning 14 characters creates a *new*
normalization problem, and forces non-Biblical-scholar users of
pointed Hebrew text to carry *that* particular can.

How does a user of pointed Hebrew text know whether they are
dealing with the legacy points, which people will have gone
on using, outside the context of the group of cognoscenti who
switch their applications and fonts over to the corrected set
of points? What happens if they edit text represented in one
scheme with a tool meant for the other? What about searches
on data with pointed Hebrew -- should it normalize the two
sets of points or not? (And here I am talking about normalization
by an ad hoc, custom folding, rather than generic Unicode
normalization.) Who carries the can for writing the conversion
routines from data in one scheme or the other? How about
conversion from legacy character sets for bibliographic
data -- does that need to be upgraded? How about database
implementations -- do they need custom extensions to do this
folding as part of their query optimizations? And if the
problem with the existing set of points is that their
use in a normalized context eliminates distinctions that
should be maintained, how do I write any conversion routines
in such a way as to not corrupt or otherwise contaminate data
using the new scheme? Who do I blame if my Hebrew fonts works
with one set of points but not the other, and I'm getting
intermittently trashed display as a result? ... and so on...

I think if you really sit down and think about this in the
larger context of users of Unicode Hebrew generically, instead
of merely the Biblical Hebrew community that you are trying
to find a solution for, you may realize that displacing the
pain to *other* users may not be the best solution, either.

While the solution I am suggesting is not without its
conversion problems, I think they are significantly more
tractable than those posed by cloning code points. The
folding issue is much more straightforward, since it would
consist entirely of ignoring the CGJ and applying standard
normalization (or not). The new scheme would essentially be transparent
to systems that don't bother inserting CGJ between points,
as long as their fonts could handle the combinations.
Loss of distinctions in order for data which is exported
from the new systems, and then reimported, would be much
less of an issue, since normalization could not destroy
the distinctions without further intervention. 

> I believe the aim in fixing this 
> problem in Unicode should be to provide Biblical scholars with a good text 
> processing experience, not with awkward kludges,

Yes, but I believe that is the responsibility of the systems and
applications designers, given the tools and constraints we have
to hand. 

> even if that means making 
> the Unicode Hebrew block look weird with duplicated marks. 

I really believe there be dragons there, and the end result will
be to make it *more* difficult for the systems and applications
designers to provide a "good text processing experience" to
all users of pointed Hebrew text.


--Ken




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
John,

> At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
> 
> >Why is making use of the existing behavior of existing characters
> >a "groanable kludge", if it has the desired effect and makes
> >the required distinctions in text? If there is not some
> >rendering system or font lookup showstopper here, I'm inclined
> >to think it's a rather elegant way out of the problem.
> 
> I think assumptions about not breaking combining mark sequences may, in 
> fact, be a showstopper. If  becomes 
> , it is reasonable to think that this will not 
> only inhibit mark re-ordering but also mark combining and mark 
> interraction. Unfortunately, this seems to be the case with every control 
> character I have been able to test, using two different rendering engines 
> (Uniscribe and InDesign ME -- although the latter already has some problems 
> with double marks in Biblical Hebrew). Perhaps we should have a specific 
> COMBINING MARK SEQUENCE CONTROL character?

Actually, in casting around for the solution to the problem of
introduction of format controls creating defective combining
character sequences, it finally occurred to me that:

U+034F COMBINING GRAPHEME JOINER

has the requisite properties.

It is non-visible, does not affect the display of neighboring
characters (except incidentally, if processes choose to recognize
sequences containing it and process them distinctly), *AND*
it is a *combining mark*, not a format control.

Hence, the sequence:


   0  170 14

is *not* a defective combining character sequence, by the
definitions in the standard. The entire sequence of three
combining marks would have to "apply" to the lamed, but
the fact that CGJ has (cc=0) prevents the patah from reordering
around the hiriq under normalization.

Could this finally be the missing "killer ap" for the CGJ?

> 
> All that said, I disagree with Ken that this is anything like an elegant 
> way out of the problem. Forcing awkward, textually illogical and easily 
> forgetable control character usage onto *users* in order to solve a problem 
> in the Unicode Standard is not elegant, and it is unlikely to do much for 
> the reputation of the standard.

I don't understand this contention. There is no reason, in principle,
why this has to be surfaced to end users of Biblical Hebrew, any
more than messy details of embedding override controls has to be surfaced
to end users in order to make an interface which will support end user
control over direction in bidirectional text.

If CGJ is the one, then the only *real* implementation requirement would
be that CGJ be consistently inserted (for Biblical Hebrew) between
any pair of points applied to the same consonant. Depending on the
particular application, this could either be hidden behind the
input method/keyboard and be actively managed by the software, or
it could be applied as a filter to an export format, when exporting
to contexts that might neutralize intended contrasts or result in
the wrong display by the application of normalization.

> 
> Q: 'Why do I have to insert this control character between these points?'
> A: 'To prevent them from being re-ordered.'
> Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in 
> the order I put them in?'
> A: 'Because Unicode normalisation will automatically re-order the points.'
> Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
> A: 'Yes, but the people who decided how normalisation should work for 
> Hebrew didn't know that.'
> Q: 'Well can't they fix it?'
> A: 'They have: they've told you that you have to insert this control 
> character...'

And that whole dialogue should be limited to the *programmers* only,
whose job it is then to hide the details of how they get the
magic to work from people who would find those details just confusing.

> Q: 'But *I* didn't make the mistake. Why should I have to be the one to 
> mess around with this annoying control character?'
> 
> ... and so on.
> 
> Much as the duplication of Hebrew mark encoding may be distasteful, and 
> even considering the work that will need to be done to update layout 
> engines, fonts and documents to work with the new mark characters, I agree 
> with Peter Constable that this is by far the best long term solution, 
> especially from a *user* perspective. 

I have to disagree. It should be largely irrelevant to the user perspective.
In this case (as in others) the users are the experts about what their
expected requirements are for text behavior, and in particular, what
distinctions need to be maintained. But they should not be expected
to define the technical means for fulfilling those requirements, nor
lean over the shoulders of the engineers to tell them how to write
the software to accomplish it.

> Over the past two months I have been 
> over this problem in great detail with the Society of Biblical Literature 
> and their partners in the SBL Font Foundation. They understand the problems 
> with the current nor

Re: Biblical Hebrew

2003-06-26 Thread Mark Davis
1. I agree with Ken about the current lack of precedent for Cfs before
combining marks. Interestingly, that we do have a proposal to do just
that, in

http://www.unicode.org/review/pr-9.pdf

However, note that the whole purpose of putting the Cf after the Ra is
to separate it from the halant, so that the halant will ligate with
the following character rather than the preceding. So in that sense,
PR#9 is entirely consistent with breaking a combining character
sequence into two parts.

2. Because Cfs do break combining sequences, I would be very leery of
using any of them to solve the Biblical Hebrew issue. One possibility
is to use a combining mark instead. That is, something with (α) no
visible glyph, (β) combining class = 0, and (γ) general category = Mn.
Unlike the Cfs, this would *not* break a combining sequence. There
would be two possibilities.

a. define a new character with these characteristics.
b. use a variation selection character.

Now, we decided that VS characters would not apply to any but base
characters, but one of the primary reasons for that was so that they
wouldn't disturb canonical order. So easing this restriction in this
case might be reasonable, since that is exactly the point!

Of course, such a change would need to be sanctioned by the UTC, and
it might take a while before fonts supported it, but it may be a way
out, one that doesn't require waiting for the assignment of a new
character. So this is in the spirit of Ken's original proposal.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, June 26, 2003 17:48
Subject: Re: Biblical Hebrew


> Rick wrote:
>
> > > I now like better the suggestions of RLM or WJ for this.
> >
> > I'll have to disagree with Ken. I'm not so sure about either of
these. I
> > don't think anyone has, in the past, considered what conforming or
> > non-conforming behavior would be for a RLM or WJ between two
combining
> > marks. This needs a bunch more study to determine what on earth it
would
> > break in existin implementations.
>
> Point taken.
>
> >
> > On the other hand, ZWJ between two combining marks has at least
been
> > discussed, and in the case of Indic anyway, it has known,
documented
> > effects.
>
> This, however, has the same problem. The specification of the use
> of ZWJ and ZWNJ in Indic scripts is not *between* two combining
> marks, but following a combining mark (halant), preceding another
> base character, usually a consonant. So we don't really know what
> the implications of trying to put it between two combining marks
> would be -- there aren't any specifications for doing so (yet).
>
> --Ken
>
>
>




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Michael wrote:

> At 15:36 -0700 2003-06-26, Kenneth Whistler wrote:
> 
> >I now like better the suggestions of RLM or WJ for this.
> 
> ZZZT. Thank you for playing.
> 
> RLM is for forcing the right behaviour for stops and parentheses and 
> question marks and so on. Introducing it between two combining 
> characters in Hebrew text would break all kinds of things,

True, apparently, but not for the reasons you surmise.

RLM does not "force behavior" on things. It is a strong
right-to-left context that can change the resolved directionality
of neutrals or weak types next to it. In between two
characters that are already R, the presence or absence of an
RLM is basically a no-op for bidi.

Just considering the bidi algorithm, a sequence:

  
 R  NSM   R NSM
  
would have the resolved directions: , effectively no
different than the resolved direction:  of the sequence
without the RLM.

The problem arises when you go to consider the graphic application
of the combining mark to its base form, and for that, the issue
is apparently the same for the WJ, ZWJ, or any other format
control in such a position. So this is nothing to do with the
bidi function of RLM. 

> and would 
> be horrible, horrible, horrible. Invent a new control character for 
> this weird property-killer, if you must, but don't use an ordering 
> mark for it

If you invent a "new control character" for this "weird property-killer"
(which it wouldn't be, since in any case, I'm just talking about
inserting a (cc=0) character in between two other characters, not
changing or killing any properties), you still end up with exactly
the same problem of graphic application, because the
presence of any format control creates a defective combining
character sequence which applications (apparently) won't display.

--Ken





Re: Biblical Hebrew

2003-06-26 Thread Kenneth Whistler
Rick wrote:

> > I now like better the suggestions of RLM or WJ for this.
> 
> I'll have to disagree with Ken. I'm not so sure about either of these. I  
> don't think anyone has, in the past, considered what conforming or  
> non-conforming behavior would be for a RLM or WJ between two combining  
> marks. This needs a bunch more study to determine what on earth it would  
> break in existin implementations.

Point taken.

> 
> On the other hand, ZWJ between two combining marks has at least been  
> discussed, and in the case of Indic anyway, it has known, documented  
> effects.

This, however, has the same problem. The specification of the use
of ZWJ and ZWNJ in Indic scripts is not *between* two combining
marks, but following a combining mark (halant), preceding another
base character, usually a consonant. So we don't really know what
the implications of trying to put it between two combining marks
would be -- there aren't any specifications for doing so (yet).

--Ken




Re: Biblical Hebrew

2003-06-26 Thread John Hudson
At 03:52 PM 6/26/2003, Rick McGowan wrote:

I'll weigh in to agree with Ken here. The solution of cloning a whole set
of these things just to fix combining behavior is, to understate, not quite
nice.
No, but would be far from the not nicest thing in Unicode, and there's a 
really good reason for it. I was originally intrigued by Ken's ZWJ idea -- 
or by a variant of it using some new re-ordering inhibiting character, to 
avoid overloading ZWJ any further --, but the more I think about it, the 
more not nice I think it is to force Biblical scholars to carry the can for 
errors in the Unicode combining classes.

Control characters, usually ZWJ and ZWNJ, seem to get proposed as solutions 
to all sorts of text processing complexities. Some of these are perfectly 
legitimate and reflect the need of users to be able to to control the 
display of text in different ways, e.g. by forcing half-forms in Indic 
scripts. But I don't think control characters should be used as fixes for 
mistakes, especially not when the distinction is not between two different 
but equally valid ways of displaying the same text, e.g. as a conjunct 
ligature or with half-forms, but between displaying text correctly or 
incorrectly. How many English users would accept a text processing model in 
which the distinction between 'goal' and 'gaol' relied on insertion of a 
control character between the vowels? I believe the aim in fixing this 
problem in Unicode should be to provide Biblical scholars with a good text 
processing experience, not with awkward kludges, even if that means making 
the Unicode Hebrew block look weird with duplicated marks. The standard 
should serve the users, not the aesthetic and organisational sensitivities 
of the people who design the standard.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 03:36 PM 6/26/2003, Kenneth Whistler wrote:

Why is making use of the existing behavior of existing characters
a "groanable kludge", if it has the desired effect and makes
the required distinctions in text? If there is not some
rendering system or font lookup showstopper here, I'm inclined
to think it's a rather elegant way out of the problem.
I think assumptions about not breaking combining mark sequences may, in 
fact, be a showstopper. If  becomes 
, it is reasonable to think that this will not 
only inhibit mark re-ordering but also mark combining and mark 
interraction. Unfortunately, this seems to be the case with every control 
character I have been able to test, using two different rendering engines 
(Uniscribe and InDesign ME -- although the latter already has some problems 
with double marks in Biblical Hebrew). Perhaps we should have a specific 
COMBINING MARK SEQUENCE CONTROL character?

All that said, I disagree with Ken that this is anything like an elegant 
way out of the problem. Forcing awkward, textually illogical and easily 
forgetable control character usage onto *users* in order to solve a problem 
in the Unicode Standard is not elegant, and it is unlikely to do much for 
the reputation of the standard.

Q: 'Why do I have to insert this control character between these points?'
A: 'To prevent them from being re-ordered.'
Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in 
the order I put them in?'
A: 'Because Unicode normalisation will automatically re-order the points.'
Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
A: 'Yes, but the people who decided how normalisation should work for 
Hebrew didn't know that.'
Q: 'Well can't they fix it?'
A: 'They have: they've told you that you have to insert this control 
character...'
Q: 'But *I* didn't make the mistake. Why should I have to be the one to 
mess around with this annoying control character?'

... and so on.

Much as the duplication of Hebrew mark encoding may be distasteful, and 
even considering the work that will need to be done to update layout 
engines, fonts and documents to work with the new mark characters, I agree 
with Peter Constable that this is by far the best long term solution, 
especially from a *user* perspective. Over the past two months I have been 
over this problem in great detail with the Society of Biblical Literature 
and their partners in the SBL Font Foundation. They understand the problems 
with the current normalisation, and they understand that any solution is 
going to require document and font revisions; they're resigned to this, and 
they've worked hard to come up with combining class assignments that would 
actually work for all consonant + mark(s) sequences encountered in Biblical 
Hebrew. This work forms the basis of the proposal submitted by Peter 
Constable. Encoding of new Biblical Hebrew mark characters provides a 
relatively simple update path for both documents and fonts, since it 
largely involves one-to-one mappings from old characters to new.

Conversely, insisting on using control characters to manage mark ordering 
in texts will require analysis to identify those sequences that will be 
subject to re-ordering during normalisation, and individual insertion of 
control characters. The fact that these control characters are invisible 
and not obvious to users transcribing text, puts an additional burden on 
application and font support, and adds another level of complexity to using 
what are already some of the most complicated fonts in existence (how many 
fonts do you know that come with 18 page user manuals?). I think it is 
unreasonable to expect Biblical scholars to understand Unicode canonical 
ordering to such a deep level that they are able to know where to insert 
control characters to prevent a re-ordering that shouldn't be happening in 
the first place.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew

2003-06-26 Thread Rick McGowan
Ken wrote...

> I now like better the suggestions of RLM or WJ for this.

I'll have to disagree with Ken. I'm not so sure about either of these. I  
don't think anyone has, in the past, considered what conforming or  
non-conforming behavior would be for a RLM or WJ between two combining  
marks. This needs a bunch more study to determine what on earth it would  
break in existin implementations.

On the other hand, ZWJ between two combining marks has at least been  
discussed, and in the case of Indic anyway, it has known, documented  
effects.

> > At least with
> > having distinct vowel characters for Biblical Hebrew, we'd come to a
> point
> > we could forget about it, and wouldn't be wincing every time we considered  
> > it.
>
> Au contraire. We'll be wincing forever for this one. There's
> no way of getting around the fact that this is merely a cloning
> of a the whole set of points in order to have candidates for
> a reassigned set of combining classes.

I'll weigh in to agree with Ken here. The solution of cloning a whole set  
of these things just to fix combining behavior is, to understate, not quite  
nice.

The *best* thing to do, in my personal opinion and I know it'll get shot  
down so don't bother telling me so, is to fix the combining classes of the  
Hebrew points.

Since the combining classes can't be fixed because we have the  
normalization-stability albatross firmly down our gullets and will forever  
be choking on that, the next best thing is to use a ZWJ. Problem solved.  
Just document it.

Rick



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Michael Everson
At 15:36 -0700 2003-06-26, Kenneth Whistler wrote:

I now like better the suggestions of RLM or WJ for this.
ZZZT. Thank you for playing.

RLM is for forcing the right behaviour for stops and parentheses and 
question marks and so on. Introducing it between two combining 
characters in Hebrew text would break all kinds of things, and would 
be horrible, horrible, horrible. Invent a new control character for 
this weird property-killer, if you must, but don't use an ordering 
mark for it.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 03:04 PM 6/26/2003, Kenneth Whistler wrote:

> How about RLM?

This already belongs, naturally, in the context of the Hebrew
text handling, which is going to have to handle bidi controls.
Ouch. RLM is not expected to fall between combining marks. Not only does 
this not render correctly, Uniscribe treats it as an illegal sequence and 
inserts a dotted circle before the second mark.

Another possibility to consider is U+2060 WORD JOINER, the
version of the zero width non-breaking space unfreighted with
the BOM confusion of U+FEFF.
I can't test this at the moment, because none of the fonts I have support it.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 02:45 PM 6/26/2003, Mark Davis wrote:

Another consequence is that it separates the sequence into two
combining sequences, not one. Don't know if this is a serious problem,
especially since we are concerned with a limited domain with
non-modern usage, but I wanted to mention it.
It is a serious problem if separate combining sequences means, as it seems 
to in all the current apps I have tested, that marks separated by one of 
these control characters cannot be correctly positioned relative to a 
preceding consonant. Insertion of any zero-width control character between 
two marks applied to the same Hebrew consonant results in a loss of 
interraction between the marks (i.e. the first mark is not repositioned to 
accomodate the second) and the second mark loses all positioning 
intelligence and falls between the consonant and the next one. My guess is 
that the layout engine (Uniscribe in this case) makes the reasonable 
assumption that the two combining sequences do not interract.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Peter responded:

> Ken Whistler wrote on 06/25/2003 06:57:56 PM:
> 
> > People could consider, for example, representation
> > of the required sequence:
> > 
> >   
> > 
> > as:
> > 
> >   
> 
> So, we want to introduce yet *another* distinct semantic for ZWJ?

Actually, no, I don't. That was just the first candidate that
came to mind.
 
> We've 
> got one for Indic, another for Arabic, another for ligatures (similar to 
> that for Arabic, but slightly different). Now another that is "don't 
> affect any visual change, just be there to inhibit reordering under 
> canonical ordering / normalization"?

As I pointed out in a separate response, just putting the ZWJ
there would *already* interrupt the reodering of the sequence.
There is nothing new about that. The problem is that you might
not be able to count on it not effecting a visual change,
because the generic meaning of ZWJ is now intended to be
ligation requesting, which does have visual consequences.

I now like better the suggestions of RLM or WJ for this. Both
of those format controls, by *definition*, should have no
impact on visual display in this context, the RLM because it
would be inserted between two NSM's that pick up strong
R-to-L directionality from the consonant, and the WJ
because it would be inserted at a position where there already
is no word/line break opportunity. But either of them,
by their current definition and properties, would break the
sequences for canonical reordering. So they already have
the semantics of the putative new control in question: no
effect on visual display, while inhibiting of the canonical
reordering of the point sequence.

> > The presence of a ZWJ (cc=0) in the sequence would block
> > the canonical reordering of the sequence to hiriq before
> > qamets. If that is the essence of the problem needing to
> > be addressed, then this is a much simpler solution which would
> > impact neither the stability of normalization nor require
> > mass cloning of vowels in order to give them new combining
> > classes.
> 
> Yes, it would accomplish all that; and is groanable kludge. 

Why is making use of the existing behavior of existing characters
a "groanable kludge", if it has the desired effect and makes
the required distinctions in text? If there is not some
rendering system or font lookup showstopper here, I'm inclined
to think it's a rather elegant way out of the problem.

> At least with 
> having distinct vowel characters for Biblical Hebrew, we'd come to a point 
> we could forget about it, and wouldn't be wincing every time we considered 
> it.

Au contraire. We'll be wincing forever for this one. There's
no way of getting around the fact that this is merely a cloning
of a the whole set of points in order to have candidates for
a reassigned set of combining classes.

You're stuck between a rock and a hard place on this one.

The UTC cannot entertain merely fixing the existing combining
class assignments, because it breaks the normalization stability
guarantee. We've all come to acknowledge and most to accept that,
even though it still elicits groans.

But in the 10646 WG2 context, coming in with a duplicate set
of Hebrew points is not going to make any sense, because, as
someone (John Cowan?) has already pointed out, 10646 doesn't
assign combining classes, and so trying to justify character
cloning on the basis of distinct combining class assignments
isn't going to make any sense there. You can always come in
with the proposal to encode BIBLICAL HEBREW POINT PATAH and
say, even though the glyph is identical, see, the name is
different, so the character is different. But this is a pretty
thin disguise, and is vulnerable to simple questioning:
What is it for? Well, to point Biblical Hebrew texts. But
what was U+05B7 HEBREW POINT PATAH for? Well, to point Biblical
Hebrew texts (or any Hebrew text, for that matter...). Well,
then, what is the difference? Uh, the combining classes for
the two are different. What is a combining class?  ... and
so on.

I'm trying to find a way, using existing characters and a
simple set of text representational conventions, to make
the distinctions and preserve the order relations that you
need for decent font lookup, without the whole enterprise
washing up on either of those two rocks.

--Ken




Re: Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Jony took the words right out of my mouth:

> How about RLM?
> 
> Jony

This already belongs, naturally, in the context of the Hebrew
text handling, which is going to have to handle bidi controls.

Another possibility to consider is U+2060 WORD JOINER, the
version of the zero width non-breaking space unfreighted with
the BOM confusion of U+FEFF.

WJ is also (gc=Cf, cc=0), so would block canonical reordering
of a sequence it was inserted into. Unlike ZWJ, it should have no 
potentially conflicting semantics regarding ligation or anything
else for display. It is *defined* only as specifying no break
opportunity at its position:

  "...inserting a word joiner between two characters has no
  effect on their ligating and cursive joining behavior. The
  word joiner should be ignored in contexts other than word
  or line breaking."
  
Well, as before, we already know that 
^
is not a word or line break opportunity, so inserting a WJ
there should have no effect. And by definition, it should also
have no effect on any glyph ligation (or any other aspect of
the display). But it *would* break up the sequence that
gets canonically reordered for normalization, thus enabling
a textual distinction to be preserved.

One might even want to suggest that if RichEdit or some other
text control causes a display problem when WJ is inserted between
two Hebrew points, that should be considered a bug in the
implementation of the WORD JOINER for that text control.

Of course, I'm not privy to the internals of such implementations
and don't understand the font lookup issues in the kind of
detail that John clearly does, but if WORD JOINER cannot
be implemented as the standard says it should be, then we've
got a more serious problem on our hand than just the
Biblical Hebrew vocalization issue.

--Ken

> > 
> > At 04:26 AM 6/26/2003, Jony Rosenne wrote:
> > 
> > >I don't think we need any new characters, ZERO WIDTH SPACE 
> > would do and 
> > >it requires no new semantics.
> > 
> > ZERO WIDTH SPACE would screw up search and sort algorithms, I think, 
> > because it is not a control character per se and may not be 
> > ignored as desired.
> > 
> > I've made some tests using Ken's ZWJ suggestion and, as 
> > feared, it messes 
> > with the glyph positioning lookups. The results varied 
> > slightly between MS 
> > RichText clients and InDesign ME, but both displayed marks 
> > incorrectly when 
> > ZWJ was inserted. I strongly suspect that this is not 
> > something that can 
> > easily be resolved in the glyph shaping model.
> > 
> > John Hudson




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Mark Davis
Another consequence is that it separates the sequence into two
combining sequences, not one. Don't know if this is a serious problem,
especially since we are concerned with a limited domain with
non-modern usage, but I wanted to mention it.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, June 26, 2003 13:41
Subject: Re: Biblical Hebrew (Was: Major Defect in Combining Classes
of Tibetan Vowels)


> Peter replied to Karljürgen:
>
> > Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM:
> >
> > > I was going to suggest something very similar, a
ZW-pseudo-consonant of
> > some
> > > kind, which would force each vowel to be associated with one
consonant.
> >
> > An invisible *consonant* doesn't make sense because the problem
involves
> > more than just multiple written vowels on one consonant;
>
> I agree that we don't want to go inventing invisible consonants for
> this.
>
> BTW, there's already an invisible vowel (in fact a pair of them)
> that is unwanted by the stakeholders of the script it was
> originally invented for:
>
> U+17B4 KHMER VOWEL INHERENT AQ
>
> This is also (cc=0), so would serve to block canonical reordering
> if placed between two Hebrew vowel points. But I'm sure that if
> Peter thought the suggestion of the ZWJ for this was a "groanable
> kludge", Biblical Hebraicists would probably not take lightly
> to the importation of an invisible Khmer character into their
> text representations. ;-)
>
> > in fact, that is
> > a small portion of the general problem. If we want such a
character, it
> > would notionally be a zero-width-canonical-ordering-inhibiter, and
nothing
> > more.
>
> The fact is that any of the zero-width format controls has the
> side-effect of inhibiting (or rather interrupting) canonical
reordering
> if inserted in the middle of a target sequence, because of their
> own class (cc=0).
>
> I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even
> U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ
because
> it seemed in the ballpark. ZWNBSP would likely have fewer possible
> other consequences, since notionally it means just "don't break
here",
> which you wouldn't do in the middle of a Hebrew combining character
> sequence, anyway.
>
> > And I don't particular want to think about what happens when
people start
> > sticking this thing into sequences other than Biblical Hebrew ("in
> > unicode, any sequence is legal").
>
> But don't forget that these cc=0 zero width format controls already
> can be stuck into sequences other than Biblical Hebrew. In some
> instances they have defined semantics there (as for Arabic and
> Indic scripts), but in all cases they would *already* have the
> effect of interrupting canonical reordering of combining character
> sequences if inserted there.
>
> --Ken
>
>
>
>




RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Jony Rosenne
That may be what you see. Myself, every time I look at it, I see an orphaned
Hiriq without a consonant. It is normally placed in between the Lamed and
the Mem, to make certain the point isn't missed (a pun). 

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> [EMAIL PROTECTED]
> Sent: Thursday, June 26, 2003 7:09 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Major Defect in Combining Classes of Tibetan 
> Vowels (Hebrew)
> 
> 
> Jony Rosenne wrote on 06/26/2003 06:26:02 AM:
> 
> > It may look, silly, but it is correct. What you see are letters
> according to
> > the writing tradition, which does not include a Yod, and vowels
> according to
> > the reading tradition which does.
> 
> I understand that. My point was, you were talking about 
> phonology, but in 
> terms of the text, it was not correct: there *are* multiple 
> vowels on a 
> single consonant.
> 
> 
> > There are in the Bible other, more extreme
> > cases.
> 
> I'd be interested on whatever info you can provide in that regard.
> 
> 
>  
> > I don't think we need any new characters, ZERO WIDTH SPACE would do 
> > and
> it
> > requires no new semantics.
> 
> No, that's a terrible solution: a space creates unwanted word 
> boundaries.
> 
> 
> > Moreover, everybody who knows his Hebrew Bible
> > knows the Yod is there although it isn't written.
> 
> But the point is, how to people encode the text? The yod is 
> not there in 
> the text. How does a publisher encode text in the typesetting 
> process? How 
> do researchsers encode the text they want to analyze? Saying, 
> "everybody 
> knows there's a yod there" doesn't provide a solution, 
> particular given 
> that the researchers know in point of fact that the consonantal text 
> explicitly does not include a yod.
> 
> 
>  
> > The Meteg is a completely different issue. There is a small 
> number of
> places
> > were the Meteg is placed differently. Since it does not behave the 
> > same
> as
> > the regular Meteg, and is thus visually distinguishable, it 
> should be 
> > possible to add a character, as long as it is clearly named.
> 
> That is a potential solution, thought it would have to be 
> *two* additional 
> metegs.
> 
> 
> 
> - Peter
> 
> 
> --
-
> Peter Constable
> 
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> 
> 
> 
> 




Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Jony Rosenne
How about RLM?

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson
> Sent: Thursday, June 26, 2003 6:36 PM
> To: Jony Rosenne
> Cc: [EMAIL PROTECTED]
> Subject: SPAM: RE: Major Defect in Combining Classes of 
> Tibetan Vowels (Hebrew)
> 
> 
> At 04:26 AM 6/26/2003, Jony Rosenne wrote:
> 
> >I don't think we need any new characters, ZERO WIDTH SPACE 
> would do and 
> >it requires no new semantics.
> 
> ZERO WIDTH SPACE would screw up search and sort algorithms, I think, 
> because it is not a control character per se and may not be 
> ignored as desired.
> 
> I've made some tests using Ken's ZWJ suggestion and, as 
> feared, it messes 
> with the glyph positioning lookups. The results varied 
> slightly between MS 
> RichText clients and InDesign ME, but both displayed marks 
> incorrectly when 
> ZWJ was inserted. I strongly suspect that this is not 
> something that can 
> easily be resolved in the glyph shaping model.
> 
> John Hudson
> 
> Tiro Typeworkswww.tiro.com
> Vancouver, BC [EMAIL PROTECTED]
> 
> If you browse in the shelves that, in American bookstores,
> are labeled New Age, you can find there even Saint Augustine, 
> who, as far as I know, was not a fascist. But combining Saint 
> Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
>  
> - Umberto Eco
> 
> 
> 
> 




Re: Revised N2586R

2003-06-26 Thread Michael Everson
At 13:23 -0700 2003-06-26, Kenneth Whistler wrote:

Not only is the name likely to change (based on all the issues 
already discussed), but it is conceivable that WG2 could decide to 
approve it at some other code position instead.
Indeed I will probably propose to move the character on general 
principles. ;-) No cheating! ;-)

It is even conceivable that WG2 could *refuse* to encode the character.
(I shouldn't think so.)

There have been precedents, where a UTC approved character met 
opposition in WG2, and the UTC later decided to rescind its approval 
in favor of maintaining
synchronization of the standards when published.
And vice versa.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Peter replied to Karljürgen:

> Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM:
> 
> > I was going to suggest something very similar, a ZW-pseudo-consonant of 
> some
> > kind, which would force each vowel to be associated with one consonant.
> 
> An invisible *consonant* doesn't make sense because the problem involves 
> more than just multiple written vowels on one consonant; 

I agree that we don't want to go inventing invisible consonants for
this.

BTW, there's already an invisible vowel (in fact a pair of them)
that is unwanted by the stakeholders of the script it was
originally invented for:

U+17B4 KHMER VOWEL INHERENT AQ

This is also (cc=0), so would serve to block canonical reordering
if placed between two Hebrew vowel points. But I'm sure that if
Peter thought the suggestion of the ZWJ for this was a "groanable
kludge", Biblical Hebraicists would probably not take lightly
to the importation of an invisible Khmer character into their
text representations. ;-)

> in fact, that is 
> a small portion of the general problem. If we want such a character, it 
> would notionally be a zero-width-canonical-ordering-inhibiter, and nothing 
> more.

The fact is that any of the zero-width format controls has the
side-effect of inhibiting (or rather interrupting) canonical reordering
if inserted in the middle of a target sequence, because of their
own class (cc=0).

I'm not particularly campaigning for ZWJ, by the way. ZWNJ or even
U+FEFF ZWNBSP would accomplish the same. I just suggested ZWJ because
it seemed in the ballpark. ZWNBSP would likely have fewer possible
other consequences, since notionally it means just "don't break here",
which you wouldn't do in the middle of a Hebrew combining character
sequence, anyway.

> And I don't particular want to think about what happens when people start 
> sticking this thing into sequences other than Biblical Hebrew ("in 
> unicode, any sequence is legal").

But don't forget that these cc=0 zero width format controls already
can be stuck into sequences other than Biblical Hebrew. In some
instances they have defined semantics there (as for Arabic and
Indic scripts), but in all cases they would *already* have the
effect of interrupting canonical reordering of combining character
sequences if inserted there.

--Ken





Re: Nightmares

2003-06-26 Thread Michael Everson
At 14:32 -0400 2003-06-26, John Cowan wrote:

If you are going to discriminate (invidiously) using a computerized
database, using H for Handicapped (or G for Gimp) will do just as well.
Are you going to complain about the various symbols of religion already
encoded on the same grounds?
I am preparing additional religious symbols to help fill the gaps.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 10:09 AM 6/26/2003, [EMAIL PROTECTED] wrote:

> The Meteg is a completely different issue. There is a small number 
of  places
> were the Meteg is placed differently. Since it does not behave the same as
> the regular Meteg, and is thus visually distinguishable, it should be
> possible to add a character, as long as it is clearly named.

That is a potential solution, thought it would have to be *two* additional
metegs.
Can you explain your thinking here, Peter? I agree that if the intention is 
to encode new Biblical Hebrew marks with revised combining classes, then 
two new metegs would be necessary if we want one left and one right. But if 
one were to accept the text encoding hack of a ZERO-WIDTH CANONICAL 
ORDERING INHIBITOR -- which seems less and less like a good idea, and more 
and more like a long term embarassment and, like ZWJ and ZWNJ, a pain in 
the neck for users who have every right to expect a sensible encoding that 
doesn't require such gymnastics --, then I think one would only need a new 
HEBREW POINT RIGHT METEG character, and let it be assumed that the existing 
meteg character is the left position form (it's current combining class 
puts it after all vowels, I believe).

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Revised N2586R

2003-06-26 Thread Kenneth Whistler
Doug, Peter, and Michael already provided good responses to
this suggestion by William O, but here is a little further
clarification.

> Well, certainly authority would be needed, yet I am suggesting that where a
> few characters added into an established block are accepted, which is what
> is claimed for these characters, there should be a faster route than having
> to wait for bulk release in Unicode 4.1.  If these characters have been
> accepted, why not formally warrant their use now by having Unicode 4.001
> and then having Unicode 4.002 when a few more are accepted? 

Approvals aren't *finished* until both the UTC and ISO JTC1/SC2/WG2 have
completed their work. The JTC1 balloting and approval process is
a lengthy and deliberate one, and there are many precedents where a
proposed character, perhaps one already approved by the UTC, has
been moved in a subsequent ballotting in response to a national
body comment. Only when both committees have completed all approvals
and have verified they are finally in synch with each other, do they proceed
with formal publication of the *standardized* encodings for the
new characters.

The reasons the UTC "approves" characters and posts them in the
Pipeline page at www.unicode.org in advance of the actual final
standardization are:

  A. To avoid the chicken and the egg problem for the two
 committees. Someone has to go first on an approval, since
 the committees do not meet jointly. Sometimes the UTC
 goes first, and sometimes WG2 goes first.
 
  B. To give notice to people regarding what is in process and
 what stage of approval it is at. This helps in precluding
 duplicate submissions and also helps in assigning code points
 for new characters when we are dealing with large numbers
 of new submissions.

> These minor
> additions to the Standard could be produced as characters are accepted and
> publicised in the Unicode Consortium's webspace.  

The UTC can and does give notification regarding what characters have
reached "approved" status. The Pipeline page at www.unicode.org is,
for example, about to be updated with the 215 new character approvals
from the recent UTC meeting.

> If the characters have not
> been accepted then they cannot be considered ready to be used, yet if they
> have been accepted, what is the problem in releasing them so that people who
> want to get on with using them can do so? 

See above. Standardization bodies must move deliberately and
carefully, since if they publish mistakes, everybody is saddled
with them essentially forever. In the case of encoding large
numbers of additional characters, because the UTC has plenty of
experience at the kind of shuffling around that may occur while
ballotting is still under consideration, it would be irresponsible
to publish small revisions and encourage people to start using
characters that we know have not yet completed all steps of
the standardization process.

> Why is it that it is regarded by the Unicode Consortium
> as reasonable that it takes years to get a character through the committees
> and into use?  

Because with the experience of four major revisions of the Unicode
Standard (and numerous minor revisions) and the experience of
three major revisions of ISO/IEC 10646 (and numerous individual
amendments) under out belt, we know that is how long it takes in
actual practice.

> The idea of having to use the
> Private Use Area for a period after the characters have been accepted is
> just a nonsense.

Please take a look at:

http://www.unicode.org/alloc/Caution.html

which has long been posted to help explain why character approval
is not just an instantaneous process.

The further along a particular character happens to be in
the ISO JTC1 approval process, the less likely it is that it will
actually move before the standard is actually published.
Implementers can, of course, choose whatever level of risk
they can handle when doing early implemention of provisionally approved
characters which have not yet been formally published in
the standards. But if they guess wrong and implement a
character (in a font or in anything else) that is moved at
some point in the ballotting, then that was just the risk they
took, and they can't expect to come back to the committees
bearing complaints and grievances about it.

If you, for example, want to put U+267F HANDICAPPED SIGN
in a font now, nobody will stop you, but bear in mind that
this character is only at Stage 1 of the ISO process -- it
has not yet been considered or even provisionally approved
by WG2. Not only is the name likely to change (based on
all the issues already discussed), but it is conceivable
that WG2 could decide to approve it at some other code position
instead. It is even conceivable that WG2 could *refuse* to
encode the character. There have been precedents, where a
UTC approved character met opposition in WG2, and the UTC
later decided to rescind its approval in favor of maintaining
synchronization

Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Philippe Verdy
On Thursday, June 26, 2003 8:16 PM, Elisha Berns <[EMAIL PROTECTED]> wrote:
> It would appear from your answer that even after implementing the
> algorithm to search the Unicode block coverage of a font, the actual
> comparison "data", that is which blocks to compare and how many code
> points, is totally undefined.  Is there any kind of standard for
> defining what codepoints are required to write a given language?  This
> seems like the issue that fontconfig gets around by using all those
> .orth files which define the codepoints for a given language.  But is
> there any standardized set of language required codepoint definitions
> that could be used?
> 
> Anyways, where is the up-to-date list of Unicode blocks to be found?

On the Unicode.org website or its published book.

> It's odd to think that the old way of using Charset identifiers in
> fonts worked a lot more cleanly for finding fonts matching a
> language/language group.  I would think this kind of core issue would
> be addressed more cleanly by the font standard.

The ICU datafiles contain such list of codes needed to cover almost completely each 
combination of language+script.
Now these datafiles are shared across multiple implementations with the I18n 
initiative project, which tries to define a common source of locale data for multiple 
vendors (preiously this project was in li18nux.org now extended to cover other open 
systems than Linux, such as most BSD and Unix variants, with a joint effort with the 
GNU project and other Unix and Java solution providers)...

Of course, nothing forbids a particular text to use other characters than those 
strictly needed for a particular language...




RE: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Kenneth Whistler
Elisha Berns asked:

> It would appear from your answer that even after implementing the
> algorithm to search the Unicode block coverage of a font, the actual
> comparison "data", that is which blocks to compare and how many code
> points, is totally undefined.  Is there any kind of standard for
> defining what codepoints are required to write a given language?  This
> seems like the issue that fontconfig gets around by using all those
> .orth files which define the codepoints for a given language.  But is
> there any standardized set of language required codepoint definitions
> that could be used?

Not a standard that I know of, but there are a number of compilations
of what *characters* are required for the alphabets of various
languages. See, for example:

http://www.evertype.com/alphabets/index.html

for European languages. From each list of characters it is fairly
straightforward to derive what Unicode encoded characters would
be required to support that list.

http://www.eki.ee/itstandard/ladina/

is another source. This goes a little further afield into languages
using Cyrillic characters, and also provides information about
Unicode encodings directly.

Note that for any such listing, you still need to take into account
what punctuation or other characters might also be needed for
the language's conventional orthography/ies, since the typical
listing you will find is only for the alphabetic characters used
by the language.

> 
> Anyways, where is the up-to-date list of Unicode blocks to be found?

http://www.unicode.org/Public/UNIDATA/Blocks.txt

> 
> It's odd to think that the old way of using Charset identifiers in fonts
> worked a lot more cleanly for finding fonts matching a language/language
> group.  I would think this kind of core issue would be addressed more
> cleanly by the font standard.

Which font standard?

And this is an area where implementation strategies still seem to
be in ferment. At some point this may settle down and be the
subject of standardization, but premature standardization can
also be a problem if the wrong choices get codified too soon.

--Ken




Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread John Cowan
Elisha Berns scripsit:

> It's odd to think that the old way of using Charset identifiers in fonts
> worked a lot more cleanly for finding fonts matching a language/language
> group.  I would think this kind of core issue would be addressed more
> cleanly by the font standard.

Actually it worked by dumb luck (or market forces if you prefer).  There
was never any guarantee that because a font was encoded by Latin-1 that
it contained glyphs for all the Latin-1 characters.

-- 
All Gaul is divided into three parts: the part  John Cowan
that cooks with lard and goose fat, the partwww.ccil.org/~cowan
that cooks with olive oil, and the part thatwww.reutershealth.com
cooks with butter. -- David Chessler[EMAIL PROTECTED]



Re: WHEELCHAIR (was Revised N2586R)

2003-06-26 Thread Karljürgen Feuerherm
WHEELCHAIR SYMBOL at least has the virtue of being descriptive of the symbol
rather than of the use and thus potentially more neutral all the way around.

K
- Original Message -
From: "Michael Everson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, June 26, 2003 2:13 PM
Subject: Re: Revised N2586R


> At 12:09 -0500 2003-06-26, [EMAIL PROTECTED] wrote:
>
> >The only meaning that the Standard implies is that the character encoded
> >at codepoint x represents they symbol of a wheelchair. It does not imply
> >*anything* about how its usage in juxtaposition with the name of a person
> >should be interpreted.
>
> Indeed William's argument that "HANDICAPPED" is somehow inappropriate
> just doesn't wash. In Europe at least, many handicapped people
> consider it far more polite to be called handicapped or behindert or
> what have you than to be subject to such politically "correct"
> monstrosities as "differently abled".
>
> Which is not to say that the Name Police won't prefer WHEELCHAIR
> SYMBOL. Time will tell.
> --
> Michael Everson * * Everson Typography *  * http://www.evertype.com
>
>





Re: Nightmares

2003-06-26 Thread John Cowan
William Overington scripsit:

> This issue has arisen because of my concern that a particular symbol has
> been labelled as HANDICAPPED SIGN.  I hope that the name will be changed to
> WHEELCHAIR SYMBOL.

If you are going to discriminate (invidiously) using a computerized
database, using H for Handicapped (or G for Gimp) will do just as well.
Are you going to complain about the various symbols of religion already
encoded on the same grounds?

-- 
All Norstrilians knew what laughter was:John Cowan
it was "pleasurable corrigible malfunction".http://www.reutershealth.com
--Cordwainer Smith, _Norstrilia_[EMAIL PROTECTED]



Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Philippe Verdy
On Thursday, June 26, 2003 4:13 PM, Andrew C. West <[EMAIL PROTECTED]> wrote:

> On Thu, 26 Jun 2003 14:26:13 +0200, "Philippe Verdy" wrote:
> 
> > Isn't there a work-around with the following function (quote from
> > Microsoft MSDN):
> > (with the caveat that you first need to allocate and fill a Unicode
> > string for the
> > codepoints you want to test, and this can be lengthy if one wants
> > to retreive the full list of supported codepoints).
> > However, this is still the best function to use to know if a string
> > can effectively
> > be rendered before drawing it...
> > 
> > _*GetGlyphIndices*_
> > 
> 
> GetGlyphIndices() or Uniscribe's ScriptGetCMap() would be OK for
> checking coverage for small Unicode blocks such as Gothic (27
> codepoints) or even Mathematical Alphanumeric Symbols (992
> codepoints), but I suspect your application would freeze if you tried
> to use it to work out exact codepoint coverage of CJK-B (42,711
> codepoints) and PUA-A and PUA-B (65,534 codepoints each).

That's why I added the comment. For an effective application however,
this is a great way to check if a given text will be effectively displayed.

If not, one can use other Uniscribe functions to perform additional
mappings, and if this fails, one can add another TrueType font to a
logical font, by selecting among those that have a script bit set in
their descriptors. The application may propose to users to select
a prefered order for all fonts having a script bit set in this descriptor.

Then the application will create a logical font for that script using
this preference order. But if there's no font in the collection that
contains the glyph, there will beno other choice than displaying
the substitution glyph of the first font (such as a rectangle bullet)
normally bound to U+FFFD unless the font descriptor specifies
a specific glyph.

Other strategies are for the application to create one logical font
per language, if the text to render is labelled (out-of-band) with a
language indicator. This gives more coherent results than creating
a logical font per supperted script, notably on Latin-based
languages with many characters such as Vietnamese...

So if a markup language specifies a font family, the font stack
will include this family on top of the stack, followed by the fonts
for the language+script combination, followed by the fonts
for a particular script, and followed then by all prefered fonts
for any scripts, and finally followed by all other fonts.

-- Philippe.



RE: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Elisha Berns
Andrew West wrote:

> By looping through the "ranges" array it is possible to determine
exactly
> which
> characters in which Unicode blocks a given font covers (as long as
your
> sofware
> has an array of Unicode blocks and their codepoint ranges).


> As long as your software has an up-to-date list of
> the
> Unicode blocks and their constituent codepoints for the latest version
of
> Unicode, you will always be able to get up to date information about
> Unicode
> coverage of a font.
> 
> If you want to determine language coverage for a particular
> font,
> then all you need to do is define a minimum set of codepoints that
must be
> covered for a particular block or set of blocks to be considered as
> supporting
> that language. (Just the little matter of deciding what the minimum
set of
> codepoints would be for every language that is supported by Unicode
...)
> 

Thanks so much for the detailed reply.

It would appear from your answer that even after implementing the
algorithm to search the Unicode block coverage of a font, the actual
comparison "data", that is which blocks to compare and how many code
points, is totally undefined.  Is there any kind of standard for
defining what codepoints are required to write a given language?  This
seems like the issue that fontconfig gets around by using all those
.orth files which define the codepoints for a given language.  But is
there any standardized set of language required codepoint definitions
that could be used?

Anyways, where is the up-to-date list of Unicode blocks to be found?

It's odd to think that the old way of using Charset identifiers in fonts
worked a lot more cleanly for finding fonts matching a language/language
group.  I would think this kind of core issue would be addressed more
cleanly by the font standard.

Thanks for any help.

Yours truly,

Elisha Berns




Re: Revised N2586R

2003-06-26 Thread Michael Everson
At 12:09 -0500 2003-06-26, [EMAIL PROTECTED] wrote:

The only meaning that the Standard implies is that the character encoded
at codepoint x represents they symbol of a wheelchair. It does not imply
*anything* about how its usage in juxtaposition with the name of a person
should be interpreted.
Indeed William's argument that "HANDICAPPED" is somehow inappropriate 
just doesn't wash. In Europe at least, many handicapped people 
consider it far more polite to be called handicapped or behindert or 
what have you than to be subject to such politically "correct" 
monstrosities as "differently abled".

Which is not to say that the Name Police won't prefer WHEELCHAIR 
SYMBOL. Time will tell.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Revised N2586R

2003-06-26 Thread Michael Everson
At 13:03 +0100 2003-06-26, William Overington wrote:

Well, certainly authority would be needed, yet I am suggesting that where a
few characters added into an established block are accepted, which is what
is claimed for these characters, there should be a faster route than having
to wait for bulk release in Unicode 4.1.
No, there shouldn't. The process will not be changed. Unicode and 
ISO/IEC 10646 are synchronized, and JTC1 ballotting processes are 
what they are. No further discussion is necessary, as it is pointless.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Major Defects in Subject Lines!

2003-06-26 Thread Rick McGowan
Wow... How on earth did the subject line "Major Defect in Combining  
Classes of Tibetan Vowels" turn into a discussion of Biblical Hebrew? At  
least, people, if you're going to transmogrify the discussion, please use a  
subject line such as "Biblical Hebrew" which someone already was wise  
enough to start using on some pieces of this thread.

Thanks,

Rick
(All my own opinions, of course)



Re: Revised N2586R

2003-06-26 Thread Peter_Constable
William Overington wrote on 06/26/2003 07:03:12 AM:

> yet I am suggesting that where a
> few characters added into an established block are accepted, which is 
what
> is claimed for these characters, there should be a faster route than 
having
> to wait for bulk release in Unicode 4.1.

Once both UTC and WG2 have approved the assignment of characters to 
particular codepoints, I might risk making fonts using those codepoints 
for those characters, as it's not very likely the codepoints will be 
changed at that point. There's no guarantee that would not happen, 
however, so I certainly wouldn't distribute such fonts if I were a 
commercial foundary -- too much at stake. If an ammendment to ISO 10646 
gets published prior to a new version of Unicode, though, that would 
constitute a guarantee the codepoints will not change.



> If these characters have been
> accepted, why not formally warrant their use now by having Unicode 4.001
> and then having Unicode 4.002 when a few more are accepted?

That is not how versioning is done with the standard. Please read 
http://www.unicode.org/standard/versions/



> Some fontmakers can react to new
> releases more quickly than can some other fontmakers, so why should 
progress
> be slowed down for the benefit of those who cannot add new glyphs into 
fonts
> quickly?

Fontmakers don't need to wait until a new version is published before they 
start preparing fonts.


 
> For example, symbols for audio description, subtitles and signing are 
needed
> for broadcasting.  Will that need to have years of waiting and using the
> Private Use Area when it could be a fairly swift process and the 
characters
> could be implemented into read-only memories in interactive television 
sets
> that much sooner?

Well, if the characters haven't even been proposed for addition to the 
standard, then yes, it will take years of PUA usage.


> Why is it that it is regarded by the Unicode Consortium
> as reasonable that it takes years to get a character through the 
committees
> and into use?

Because there is a process that takes time. International standards aren't 
created by a few people working out of their garage. Some international 
standards take far longer than do updates to Unicode.



> Surely where a few characters are needed the Unicode
> Consortium and ISO need to take a twenty-first century attitude to 
getting
> the job done

It might be a good idea to become more familiar with the actual process 
and work on international standards in general before criticizing the 
people doing the work. There are a number of people working quite hard on 
this stuff, with their time being volunteered by the organizations and 
companies they represent, or from their own personal time.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Revised N2586R

2003-06-26 Thread Peter_Constable
William Overington wrote on 06/26/2003 06:24:44 AM:

> >  the name is simply a unique identifier within the std.
> 
> Well, the Standard is the authority for what is the meaning of the 
symbol
> when found in a file of plain text.  So if the symbol is in a plain text
> file before or after the name of a person then the Standard implies a
> meaning to the plain text file.

The only meaning that the Standard implies is that the character encoded 
at codepoint x represents they symbol of a wheelchair. It does not imply 
*anything* about how its usage in juxtaposition with the name of a person 
should be interpreted.


 
- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Peter_Constable
Jony Rosenne wrote on 06/26/2003 06:26:02 AM:

> It may look, silly, but it is correct. What you see are letters 
according to
> the writing tradition, which does not include a Yod, and vowels 
according to
> the reading tradition which does.

I understand that. My point was, you were talking about phonology, but in 
terms of the text, it was not correct: there *are* multiple vowels on a 
single consonant.


> There are in the Bible other, more extreme
> cases. 

I'd be interested on whatever info you can provide in that regard.


 
> I don't think we need any new characters, ZERO WIDTH SPACE would do and 
it
> requires no new semantics.

No, that's a terrible solution: a space creates unwanted word boundaries.


> Moreover, everybody who knows his Hebrew Bible
> knows the Yod is there although it isn't written.

But the point is, how to people encode the text? The yod is not there in 
the text. How does a publisher encode text in the typesetting process? How 
do researchsers encode the text they want to analyze? Saying, "everybody 
knows there's a yod there" doesn't provide a solution, particular given 
that the researchers know in point of fact that the consonantal text 
explicitly does not include a yod.


 
> The Meteg is a completely different issue. There is a small number of 
places
> were the Meteg is placed differently. Since it does not behave the same 
as
> the regular Meteg, and is thus visually distinguishable, it should be
> possible to add a character, as long as it is clearly named.

That is a potential solution, thought it would have to be *two* additional 
metegs.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread John Hudson
At 04:26 AM 6/26/2003, Jony Rosenne wrote:

I don't think we need any new characters, ZERO WIDTH SPACE would do and it
requires no new semantics.
ZERO WIDTH SPACE would screw up search and sort algorithms, I think, 
because it is not a control character per se and may not be ignored as desired.

I've made some tests using Ken's ZWJ suggestion and, as feared, it messes 
with the glyph positioning lookups. The results varied slightly between MS 
RichText clients and InDesign ME, but both displayed marks incorrectly when 
ZWJ was inserted. I strongly suspect that this is not something that can 
easily be resolved in the glyph shaping model.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread John Hudson
At 12:43 AM 6/26/2003, [EMAIL PROTECTED] wrote:

> The problem of combinations of vowels with meteg could be
> amenable to a similar approach. OR, one could propose just
> one additional meteq/silluq character, to make it possible
> to distinguish (in plain text) instances of left-side and
> right-side meteq placement, for example.
And the third position of meteg with hataf vowels? Introduce *two*
additional meteg/silluq characters?
No, that's a glyph ligation matter however you look at it. It could be made 
to work with either just a left meteg or also with a new right meteg, and 
can be inhibited with ZWNJ. This is not to say that I think encoding a 
distinct right meteg character is the best solution, only that it doesn't 
affect the medial meteg shaping.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
If you browse in the shelves that, in American bookstores,
are labeled New Age, you can find there even Saint Augustine,
who, as far as I know, was not a fascist. But combining Saint
Augustine and Stonehenge -- that is a symptom of Ur-Fascism.
- Umberto Eco



Re: Revised N2586R

2003-06-26 Thread Doug Ewell
William Overington 
wrote:

> Well, certainly authority would be needed, yet I am suggesting that
> where a few characters added into an established block are accepted,
> which is what is claimed for these characters, there should be a
> faster route than having to wait for bulk release in Unicode 4.1.
> If these characters have been accepted, why not formally warrant their
> use now by having Unicode 4.001 and then having Unicode 4.002 when a
> few more are accepted?  These minor additions to the Standard could be
> produced as characters are accepted and publicised in the Unicode
> Consortium's webspace.  If the characters have not been accepted then
> they cannot be considered ready to be used, yet if they have been
> accepted, what is the problem in releasing them so that people who
> want to get on with using them can do so?  Some fontmakers can react
> to new releases more quickly than can some other fontmakers, so why
> should progress be slowed down for the benefit of those who cannot
> add new glyphs into fonts quickly?

That's just the way standards work.  You have to wait until final, FINAL
approval and official release before you can do newly approved things
conformantly.  There has to be a chance for the authority at the very
end of the process to say, "Wait a minute, I see a problem, this can't
go out like this."  Dealing with a problem that slipped through because
the process was "fast-tracked" or sidestepped is much more expensive
than waiting for the process to run its course.  This is not "a
nonsense," it makes a lot of sense for anyone who's seen what can happen
when process is ignored.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Andrew C. West
On Thu, 26 Jun 2003 14:26:13 +0200, "Philippe Verdy" wrote:

> Isn't there a work-around with the following function (quote from Microsoft
> MSDN):
> (with the caveat that you first need to allocate and fill a Unicode string for
> the
> codepoints you want to test, and this can be lengthy if one wants to retreive
the
> full list of supported codepoints).
> However, this is still the best function to use to know if a string can
> effectively
> be rendered before drawing it...
> 
> _*GetGlyphIndices*_
> 

GetGlyphIndices() or Uniscribe's ScriptGetCMap() would be OK for checking
coverage for small Unicode blocks such as Gothic (27 codepoints) or even
Mathematical Alphanumeric Symbols (992 codepoints), but I suspect your
application would freeze if you tried to use it to work out exact codepoint
coverage of CJK-B (42,711 codepoints) and PUA-A and PUA-B (65,534 codepoints
each).

Andrew



Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Philippe Verdy
On Thursday, June 26, 2003 2:26 PM, Philippe Verdy <[EMAIL PROTECTED]> wrote:

I forgot also the probably better function from the Uniscribe library, which processes 
strings through a language-dependant shaping algorithm, and can determine appropriate 
glyph substitution, or use custom composite fonts to process character clusters into 
grapheme clusters with 1-to-1, 1-to-N, N-to-1, or N-to-M substitutions, using either 
the "cmap" table of classic TrueType fonts (which do not support characters out of the 
BMP), or the new tables added in OpenType fonts.

-- Philippe.

source: Microsoft MSDN:

*ScriptGetCMap*

The *ScriptGetCMap* function takes a string and returns the glyph indices of the 
Unicode characters according to the TrueType cmap table or the standard cmap table 
implemented for old style fonts.

HRESULT WINAPI ScriptGetCMap(
  HDC hdc, 
  SCRIPT_CACHE *psc, 
  const WCHAR *pwcInChars, 
  int cChars, 
  DWORD dwFlags, 
  WORD *pwOutGlyphs 
);

*Parameters*

/hdc/ [in] Handle to the device context. This parameter is optional.

/psc/ [in/out] Pointer to a SCRIPT_CACHE structure.

/pwcInChars/ [in] Pointer to a string of Unicode characters. 

/cChars/ [in] Number of Unicode characters in pwcInChars.

/dwFlags/ [in] Flag that specifies any special handling of the glyphs. By default, the 
glyphs of the buffer are given in logical order with no special handling. This 
parameter can be the following value.

- Value - Meaning
- SGCM_RTL - Indicates the glyph array pwOutGlyps should contain mirrored glyphs for 
those glyphs that have a mirrored equivalent.

/pwOutGlyphs/ [out] Pointer to an array that receives the glyph indexes. 

*Return Values*

If all Unicode code points are present in the font, the return value is S_OK.
If the function fails, it may return one of the following nonzero values.

- Return value - Meaning
- E_HANDLE - The font or the system does not support glyph indices.
- S_FALSE - Some of the Unicode code points were mapped to the default glyph.

If any other unrecoverable error is encountered, it is returned as an HRESULT. 

*Remarks*

ScriptGetCMap may be used to determine which characters in a run are supported by the 
selected font. The caller may scan the returned glyph buffer looking for the default 
glyph to determine which characters are not available. The default glyph index for the 
selected font should be determined by calling ScriptGetFontProperties.

The return value indicates the presence of any missing glyphs.

Note that some code points can be rendered by a combination of glyphs as well as by a 
single glyph -- for example, 00C9; LATIN CAPITAL LETTER E WITH ACUTE. In this case, if 
the font supports the capital E glyph and the acute glyph but not a single glyph for 
00C9, ScriptGetCMap will show 00C9 is unsupported. To determine the font support for a 
string that contains these kinds of code points, call ScriptShape. If it returns S_OK, 
check the output for missing glyphs.

*Requirements*

- Windows NT/2000/XP: Included in Windows 2000 and later.
- Redistributable: Requires Internet Explorer 5 or later on Windows 95/98/Me.
- Header: Declared in Usp10.h.
- Library: Use Usp10.lib.

*See Also*

Uniscribe Overview, Uniscribe Functions, ScriptGetFontProperties, ScriptShape, 
SCRIPT_CACHE




Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Philippe Verdy
On Thursday, June 26, 2003 11:50 AM, Andrew C. West <[EMAIL PROTECTED]> wrote:

> On Wed, 25 Jun 2003 21:58:28 -0700, "Elisha Berns" wrote:
> 
> > Some weeks back there were a number of postings about software for
> > viewing Unicode Ranges in TrueType fonts and I had a few questions
> > about that. Most viewers listed seemed to only check the Unicode
> > Range bits of the fonts which can be misleading in certain cases.
>
> Now the caveat. The USB sets a Surrogates bit to indicate that the
> font contains at least one codepoint beyond the Basic Multilingual
> Plane (BMP). Unfortunately the "ranges" array of the GLYPHSET
> structure only lists contiguous clumps of Unicode codepoints within
> the BMP (wcLow is a 16 bit value), and does not list surrogate
> coverage. Therefore you cannot determine supra-BMP codepoint coverage
> from the GLYPHSET structure. If anyone does know an easy way to do
> this under Windows, please let me know. 

Isn't there a work-around with the following function (quote from Microsoft MSDN):
(with the caveat that you first need to allocate and fill a Unicode string for the
codepoints you want to test, and this can be lengthy if one wants to retreive the
full list of supported codepoints).
However, this is still the best function to use to know if a string can effectively
be rendered before drawing it...

-- Philippe.

_*GetGlyphIndices*_

The *GetGlyphIndices* function translates a string into an array of glyph indices. The 
function can be used to determine whether a glyph exists in a font.

DWORD GetGlyphIndices(
  HDC hdc,   // handle to DC
  LPCTSTR lpstr, // string to convert
  int c, // number of characters in string
  LPWORD pgi,// array of glyph indices
  DWORD fl   // glyph options
);

_Parameters_

/hdc / [in] Handle to the device context. 

/lpstr/ [in] Pointer to the string to be converted. 

/c/ [in] Length of the string in pgi. For the ANSI function it is a BYTE count and for 
the Unicode function it is a WORD count. Note that for the ANSI function, characters 
in SBCS code pages take one byte each, while most characters in DBCS code pages take 
two bytes; for the Unicode function, most currently defined Unicode characters (those 
in the Basic Multilingual Plane (BMP)) are one WORD while Unicode surrogates are two 
WORDs. 

/pgi/ [out] Array of glyph indices corresponding to the characters in the string. 

/fl/ [in] Specifies how glyphs should be handled if they are not supported. This 
parameter can be the following value.

Value - Meaning
GGI_MARK_NONEXISTING_GLYPHS - Marks unsupported glyphs with the hexadecimal value 
0x.

_Return Values_

If the function succeeds, it returns the number of bytes (for the ANSI function) or 
WORDs (for the Unicode function) converted.
If the function fails, the return value is GDI_ERROR. 

*Windows NT/2000/XP*: To get extended error information, call *GetLastError*.

_Requirements_

- Windows NT/2000/XP: Included in Windows 2000 and later.
- Windows 95/98/Me: Unsupported.
- Header: Declared in Wingdi.h; include Windows.h.
- Library: Use Gdi32.lib.
- Unicode: Implemented as Unicode and ANSI versions.




Re: Nightmares

2003-06-26 Thread William Overington
Tom Gewecke wrote as follows.

> My personal idea of an Orwellian nightmare would to have a committee of
"vigilant freedom protectors" evaluating the "political and social
implications of encoding symbols" and passing judgement on whether
particular characters should be encoded and what their names should not be.

Yes, I agree that would be terrible.

The difference of your personal idea of an Orwellian nightmare from what I
am suggesting should take place is great.  I am suggesting that everybody,
as part of their activity in character encoding, be vigilant that what is
encoded does not provide an infrastructure for an Orwellian nightmare to
take place with computing systems such as databases.  The difference is like
a country having a special "riot police" force and having regular police who
wear riot gear when the need arises.  This distinction was stressed when
police in riot gear were first seen on the streets in England, as the
television news began by using the term "riot police".  So I am not
suggesting such a committee, just ordinary regular people who encode
characters being vigilant about the political and social implications of
what they are doing, lest by not concerning themselves with such an
important aspect of their work, namely the potential for causing misery, the
opportunity for such misery to occur is unthinkingly provided or is not
prevented when it easily could be prevented.

Hopefully this will clarify my thinking to you and hopefully be of interest
to people involved in character encoding discussions.

One of the great issues of the last century was as to whether scientists
should consider the political and social implications of their work or just
work as if somehow separate from society and leave the application of the
things which they discovered and developed to politicians and business
people.

This issue has arisen because of my concern that a particular symbol has
been labelled as HANDICAPPED SIGN.  I hope that the name will be changed to
WHEELCHAIR SYMBOL.

Yet what if my concerns over the need for vigilance were now dismissed?
What characters might be encoded in the future with what names?  After all,
if no one is willing to be vigilant because that very vigilance is regarded
as an Orwellian nightmare, there would then be no constraints.

I am very much someone who believes in the need for checks and balances.  I
feel that we need checks and balances in what is encoded and what names are
applied to symbols.  I also feel that we need checks and balances as to how
those checks and balances are carried out.

William Overington

26 June 2003














Re: Revised N2586R

2003-06-26 Thread William Overington
Peter Constable wrote as follows.

>  the name is simply a unique identifier within the std.

Well, the Standard is the authority for what is the meaning of the symbol
when found in a file of plain text.  So if the symbol is in a plain text
file before or after the name of a person then the Standard implies a
meaning to the plain text file.

> A name may be somewhat indicative of it's function, but is not necessarily
so.

Well, that could ultimately be an issue before the courts in a libel case if
someone publishes a text with a symbol next to someone's name.  A key issue
might well be as to what is the defined meaning of the symbol in the
Standard.  Certainly, the issue of what a reasonable person seeing that
symbol next to someone's name might conclude is being published about the
person might well also be important, even if that meaning is not in the
Standard.

> You could call it WHEELCHAIR SYMBOL, but that engineering of the standard
is not also social engineering, and people may still use it to label
individuals in a way that may be violating human rights -- we cannot stop
that. No matter what we call it, end users are not very likely going to be
aware of the name in the standard; they're just going to look for the shape,
and if they find it, they'll use it for whatever purpose they chose to.

Certainly.  Yet a plain text interchangeable file would not have the meaning
built into it by the Standard.  I agree though that there may well still be
great problems.

William Overington

26 June 2003










Re: Revised N2586R

2003-06-26 Thread William Overington
Michael Everson wrote as follows.

>At 08:44 -0700 2003-06-25, Doug Ewell wrote:
>
>>If it's true that either the UTC or WG2 has formally approved the
character, for a future version of Unicode or a future amendment to 10646,
then I don't see any reason why font makers can't PRODUCE a font with a
glyph for the proposed character at the proposed code point.

>>They just can't DISTRIBUTE the font until the appropriate standard is
released.

>That's correct.

Well, certainly authority would be needed, yet I am suggesting that where a
few characters added into an established block are accepted, which is what
is claimed for these characters, there should be a faster route than having
to wait for bulk release in Unicode 4.1.  If these characters have been
accepted, why not formally warrant their use now by having Unicode 4.001
and then having Unicode 4.002 when a few more are accepted?  These minor
additions to the Standard could be produced as characters are accepted and
publicised in the Unicode Consortium's webspace.  If the characters have not
been accepted then they cannot be considered ready to be used, yet if they
have been accepted, what is the problem in releasing them so that people who
want to get on with using them can do so?  Some fontmakers can react to new
releases more quickly than can some other fontmakers, so why should progress
be slowed down for the benefit of those who cannot add new glyphs into fonts
quickly?

For example, symbols for audio description, subtitles and signing are needed
for broadcasting.  Will that need to have years of waiting and using the
Private Use Area when it could be a fairly swift process and the characters
could be implemented into read-only memories in interactive television sets
that much sooner?  Why is it that it is regarded by the Unicode Consortium
as reasonable that it takes years to get a character through the committees
and into use?  Surely where a few characters are needed the Unicode
Consortium and ISO need to take a twenty-first century attitude to getting
the job done for people's needs rather than having the sort of delays which
might have been acceptable in days gone by.  The idea of having to use the
Private Use Area for a period after the characters have been accepted is
just a nonsense.

William Overington

26 June 2003













RE: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Jony Rosenne
It may look, silly, but it is correct. What you see are letters according to
the writing tradition, which does not include a Yod, and vowels according to
the reading tradition which does. There are in the Bible other, more extreme
cases. 

I don't think we need any new characters, ZERO WIDTH SPACE would do and it
requires no new semantics. Moreover, everybody who knows his Hebrew Bible
knows the Yod is there although it isn't written.

The Meteg is a completely different issue. There is a small number of places
were the Meteg is placed differently. Since it does not behave the same as
the regular Meteg, and is thus visually distinguishable, it should be
possible to add a character, as long as it is clearly named.

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> [EMAIL PROTECTED]
> Sent: Thursday, June 26, 2003 9:43 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Major Defect in Combining Classes of Tibetan 
> Vowels (Hebrew)
> 
> 
> Jony Rosenne wrote on 06/26/2003 12:16:22 AM:
> 
> > When, in the Bible, one sees two vowels on a given 
> consonant, it isn't
> so.
> 
> That's silly. When one sees two vowels on a given consonant 
> in the Bible, 
> it *is* so: the two vowels are written there. It may not 
> correspond to 
> actual phonology, ie what is spoken, but as has been made 
> clear on many 
> occasions, Unicode is not encoding phonology, it is encoding 
> text. And in 
> relation to text, your statement is simply wrong.
> 
> 
> > There is one vowel for the consonant one sees, and another 
> vowel for an
> > invisible consonant. The proper way to encode it is to use 
> some code to
> > represent the invisible consonant. Then the problem 
> mentioned below does 
> not
> > arise.
> 
> The idea of an invisible consonant would amount to encoding a 
> phonological 
> entity, which is the kind of thing that was at one time 
> approved for Khmer 
> (invisible characters representing inherent vowels), but 
> later turned into 
> an albatross, and when I proposed the same thing (invisible inherent 
> vowel) for Syloti Nagri, it was made very clear to me that it 
> would not go 
> down well with UTC.
> 
> Also, the proposed solution of an invisible consonant would leave 
> unresolved the problem of meteg-vowel ordering distinctions, 
> while the 
> alternate proposal of having meteg and vowels all with a class of 230 
> solves both problems at once. Two ad hoc solutions (one for 
> multi-vowel 
> ordering, and another for meteg-vowel ordering) must 
> certainly be far less 
> preferred for one motivated solution (having characters with 
> canonical 
> combining classes that are appropriate for the writing behaviours 
> exhibited).
> 
> I invite people to review the discussions from the unicoRe 
> list from last 
> December, at which time everyone (including you, Jony) were 
> all concluding 
> that the solution which I proposed in L2/03-195 was the best 
> solution to 
> pursue.
> 
> 
> - Peter
> 
> 
> --
-
> Peter Constable
> 
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> 
> 
> 




Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Andrew C. West
On Wed, 25 Jun 2003 21:58:28 -0700, "Elisha Berns" wrote:

> Some weeks back there were a number of postings about software for
> viewing Unicode Ranges in TrueType fonts and I had a few questions about
> that. Most viewers listed seemed to only check the Unicode Range bits of
> the fonts which can be misleading in certain cases.

For W2K and XP only, Microsoft provides an API for determining exactly which
Unicode codepoints a font covers.

GetFontUnicodeRanges() in the Platform SDK fills a GLYPHSET structure with
Unicode coverage information for the currently selected font in a given device
context.

The GLYPHSET structure has these members :

cGlyphsSupported - Total number of Unicode code points supported in the font
cRanges - Total number of Unicode ranges in ranges
ranges - Array of Unicode ranges that are supported in the font

Note that "cRanges" is not the number of Unicode blocks supported, and "ranges"
is not an array of Unicode blocks. Rather "ranges" is an array of WCRANGE
structures that specify contiguous clumps of Unicode codepoints, and "cRanges"
is the number of contiguous clumps of Unicode codepoints. The WCRANGE structure
has the following members :

wcLow - Low Unicode code point in the range of supported Unicode code points
cGlyphs - Number of supported Unicode code points in this range

By looping through the "ranges" array it is possible to determine exactly which
characters in which Unicode blocks a given font covers (as long as your sofware
has an array of Unicode blocks and their codepoint ranges).

Note that unlike the Unicode Subfield Bitfield (USB) that is part of the
FONTSIGNATURE structure that is filled by GetTextCharsetInfo() etc. [available
to W9X and NT as well as 2K/XP), which is limited to a particular version of
Unicode (3.0 ?) and returns supersets of Unicode blocks, the GLYPHSET structure
is version-independant. As long as your software has an up-to-date list of the
Unicode blocks and their constituent codepoints for the latest version of
Unicode, you will always be able to get up to date information about Unicode
coverage of a font.

This is the method used in my BabelMap utility, and you will note that it is
therefore able to not only list what Unicode 4.0 blocks are covered by a
particular font, but also give the exact number of codepoints that are covered
in that block. If you want to determine language coverage for a particular font,
then all you need to do is define a minimum set of codepoints that must be
covered for a particular block or set of blocks to be considered as supporting
that language. (Just the little matter of deciding what the minimum set of
codepoints would be for every language that is supported by Unicode ...)

Now the caveat. The USB sets a Surrogates bit to indicate that the font contains
at least one codepoint beyond the Basic Multilingual Plane (BMP). Unfortunately
the "ranges" array of the GLYPHSET structure only lists contiguous clumps of
Unicode codepoints within the BMP (wcLow is a 16 bit value), and does not list
surrogate coverage. Therefore you cannot determine supra-BMP codepoint coverage
from the GLYPHSET structure. If anyone does know an easy way to do this under
Windows, please let me know.

Regards,

Andrew



IUC23 Unicode conference exhibitors' panel report

2003-06-26 Thread Tex Texin
Hi,

For those of you that couldn't attend and were interested in the exhibitor's
panel at the last Unicode conference, a brief summary is now online at:

http://www.unicode.org/iuc/iuc23/showcase-report.html

If you have any comments or feedback on the page, I would be glad to receive
it off-list.

tex



Re: Uquivalence of some Japanes charcaters in Unicode

2003-06-26 Thread Tex Texin
Sourav,

Hi, your question is ambiguous to me. 
You seem to be referring to the fullwidth space and other "wide" or
"fullwidth" characters.
For the fullwidth space look at u+3000 ideographic space.
Unicode has other fullwidth characters encoded. Look at the code charts...

hth
tex

souravm wrote:
> 
> Hi All,
> 
> I have a doubt regarding existence of certain Japanese characters in
> Unicode.
> 
> The characters I'm referring are those like "Double byte space" which
> one can get from old NEC machines or can be entered thru Japanese
> keyboard only.
> 
> Can anyone please throw some light on this ?
> 
> Regards,
> Sourav

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-



24th Unicode Conference - Atlanta, GA - September 3-5, 2003

2003-06-26 Thread Tex Texin

Twenty-fourth Internationalization and Unicode Conference (IUC24)
 Unicode, Internationalization, the Web: Powering Global Business

 http://www.unicode.org/iuc/iuc24
September 3-5, 2003
Atlanta, GA

NEWS
 
 > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

 > Hotel guest room group rate valid to August 12.

 > Early bird registration rates valid to August 12.

 > To find out about, and register for the TILP Breakfast Meeting and
   Roundtable, organized by The Institute of Localisation Professionals,
   and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m.,
   See: http://www.tilponline.org/events/diary.shtml 
   or
   http://www.unicode.org/iuc/iuc24


 Are you falling behind?  Version 4.0 of the Unicode Standard is here!
 Software and Web applications can now support more languages with
 greater efficiency and lower cost.  Do you need to find out how? Do
 you need to be more competitive around the globe?  Is your software
 upward-compatible with version 4.0?  Does your staff need
 internationalization training?

 Learn about software and Web internationalization and the new Unicode
 Standard, including its latest features and requirements.  This is
 the only event endorsed by the Unicode Consortium.  The conference
 will be held September 3-5, 2003 in Atlanta, Georgia and is
 completely updated.

 KEYNOTES: Keynote speakers for IUC24 are well-known authors in the
 Internationalization and Localization industries:

 Donald De Palma, President, Common Sense Advisory, Inc., and author
 of "Business Without Borders: A Strategic Guide to Global Marketing",
 and Richard Gillam, author of "Unicode Demystified: A Practical
 Programmer's Guide to the Encoding Standard" and a former columnist
 for "C++ Report".

 TUTORIALS:  This redeveloped and enhanced Unicode 4.0 Tutorial is
 taught by Dr. Asmus Freytag, one of the major contributors to the
 standard, and extensively experienced in implementing real-world
 Unicode applications.  Structured into 3 independent modules, you
 can attend just the overview, or only the most advanced material.
 Tutorials in Web Internationalization, non-Latin scripts, and more,
 are offered in parallel and taught by recognized industry experts.

 CONFERENCE TRACKS:  Gain the competitive edge! Conference sessions
 provide the most up-to-date technical information on standards, best
 practices, and recent advances in the globalization of software and
 the Internet.  Panel discussions and the friendly atmosphere allow
 you to exchange ideas and ask questions of key players in the 
 internationalization industry.

 WHO SHOULD ATTEND?:  If you have a limited training budget, this is
 the one Internationalization conference you need.  Send staff that
 are involved in either Unicode-enabling software, or internationalization
 of software and the Internet, including: managers, software engineers,
 systems analysts, font designers, graphic designers, content developers,
 Web designers, Web administrators, technical writers, and product
 marketing personnel.

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program and Registration form are available at the
   Conference Web site:
  http://www.unicode.org/iuc/iuc24

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   ClientSide News L.L.C.
   Oracle Corporation
   World Wide Web Consortium (W3C)
   XenCraft

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

   Sign up for the Exhibitors' track as part of the Conference.
   For more information, please see:
   http://www.unicode.org/iuc/iuc24/showcase.html

CONFERENCE VENUE

The Conference will take place at:

  DoubleTree Hotel Atlanta Buckhead
  3342 Peachtree Road
  Atlanta, GA 30326

  Tel: +1-404-231-1234
  Fax: +1-404-231-3112

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
+1 858 638 0504 (fax)

   Email: [EMAIL PROTECTED]
  or: [EMAIL PROTECTED]

THE UNICODE CONSORTIUM

 The Unicode Consortium was founded as a non-profit organization in 1991.
 It is dedicated to the development, maintenance and promotion of The
 Unicode Standard, a worldwide character encoding. The Unicode Standard
 encodes

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of TibetanVowels)

2003-06-26 Thread Peter_Constable
Karljürgen Feuerherm wrote on 06/25/2003 08:31:41 PM:

> I was going to suggest something very similar, a ZW-pseudo-consonant of 
some
> kind, which would force each vowel to be associated with one consonant.

An invisible *consonant* doesn't make sense because the problem involves 
more than just multiple written vowels on one consonant; in fact, that is 
a small portion of the general problem. If we want such a character, it 
would notionally be a zero-width-canonical-ordering-inhibiter, and nothing 
more.

And I don't particular want to think about what happens when people start 
sticking this thing into sequences other than Biblical Hebrew ("in 
unicode, any sequence is legal").



> General question: when does canonical reordering take place? At input 
time,
> at rendering time, at another time?

For the purpose for which canonical ordering was intended, it occurs when 
comparing two strings for "equality" or ordering. In practice, it can 
occur at *any* time, including transmission (when it is no longer under 
the control of the author). Some protocols, and notably W3C protocols, 
require that data be canonically ordered, and recommend that this happen 
at the earliest point possible, e.g. at input time.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable
John Hudson wrote on 06/25/2003 06:47:44 PM:

> >This is not. The Unicode Standard makes no assumptions or claims
> >about what the phonological or meaning equivalence of 
> >or  is for Biblical Hebrew.
> 
> But it does make assumptions about the canonical equivalence of the mark 

> orders  and , unless my understanding of 

> the purpose of combining classes is completely mistaken.

Your understanding on this point is correct.


> My understanding 
> is that any ordering of two marks with different combining classes is 
> canonically equivalent; 

Yes.


> further, I understand that some normalisation forms 
> will re-order marks to move marks with lower combining class values 
closer 
> to the base character.

*Every* Unicode normalization form will apply canonical reordering.



> * Meteg re-ordering is in some respects even more problematic than 
> multi-vowel re-ordering

And it is because of meteg-vowel ordering distinctions that the ordering 
of things like patah + hiriq should not be solved in any way other than 
the two having the same canonical combining class, because that is exactly 
what will be needed to deal with meteg-vowel ordering distinctions.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Biblical Hebrew (Was: Major Defect in Combining Classes of TibetanVowels)

2003-06-26 Thread Peter_Constable
Ken Whistler wrote on 06/25/2003 06:57:56 PM:

> People could consider, for example, representation
> of the required sequence:
> 
>   
> 
> as:
> 
>   

So, we want to introduce yet *another* distinct semantic for ZWJ? We've 
got one for Indic, another for Arabic, another for ligatures (similar to 
that for Arabic, but slightly different). Now another that is "don't 
affect any visual change, just be there to inhibit reordering under 
canonical ordering / normalization"?



> The presence of a ZWJ (cc=0) in the sequence would block
> the canonical reordering of the sequence to hiriq before
> qamets. If that is the essence of the problem needing to
> be addressed, then this is a much simpler solution which would
> impact neither the stability of normalization nor require
> mass cloning of vowels in order to give them new combining
> classes.

Yes, it would accomplish all that; and is groanable kludge. At least with 
having distinct vowel characters for Biblical Hebrew, we'd come to a point 
we could forget about it, and wouldn't be wincing every time we considered 
it.


 
> The problem of combinations of vowels with meteg could be
> amenable to a similar approach. OR, one could propose just
> one additional meteq/silluq character, to make it possible
> to distinguish (in plain text) instances of left-side and
> right-side meteq placement, for example.

And the third position of meteg with hataf vowels? Introduce *two* 
additional meteg/silluq characters?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Major Defect in Combining Classes of Tibetan Vowels (Hebrew)

2003-06-26 Thread Peter_Constable
Jony Rosenne wrote on 06/26/2003 12:16:22 AM:

> When, in the Bible, one sees two vowels on a given consonant, it isn't 
so.

That's silly. When one sees two vowels on a given consonant in the Bible, 
it *is* so: the two vowels are written there. It may not correspond to 
actual phonology, ie what is spoken, but as has been made clear on many 
occasions, Unicode is not encoding phonology, it is encoding text. And in 
relation to text, your statement is simply wrong.


> There is one vowel for the consonant one sees, and another vowel for an
> invisible consonant. The proper way to encode it is to use some code to
> represent the invisible consonant. Then the problem mentioned below does 
not
> arise.

The idea of an invisible consonant would amount to encoding a phonological 
entity, which is the kind of thing that was at one time approved for Khmer 
(invisible characters representing inherent vowels), but later turned into 
an albatross, and when I proposed the same thing (invisible inherent 
vowel) for Syloti Nagri, it was made very clear to me that it would not go 
down well with UTC.

Also, the proposed solution of an invisible consonant would leave 
unresolved the problem of meteg-vowel ordering distinctions, while the 
alternate proposal of having meteg and vowels all with a class of 230 
solves both problems at once. Two ad hoc solutions (one for multi-vowel 
ordering, and another for meteg-vowel ordering) must certainly be far less 
preferred for one motivated solution (having characters with canonical 
combining classes that are appropriate for the writing behaviours 
exhibited).

I invite people to review the discussions from the unicoRe list from last 
December, at which time everyone (including you, Jony) were all concluding 
that the solution which I proposed in L2/03-195 was the best solution to 
pursue.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable
Michael Everson wrote on 06/25/2003 04:36:20 PM:

[ re Biblical Hebrew ]

> Write it up with glyphs and minimal pairs and people will see the 
> problem, if any. Or propose some solution. (That isn't "add duplicate 
> characters".)

The only solution that UTC is willing to consider I have already submitted 
in a proposal (L2/03-195).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable
John Cowan wrote on 06/25/2003 03:15:21 PM:

> I don't understand how the current implementation "breaks BH text".
> At worst, normalization may put various combining marks in a 
non-traditional
> order, but all alternative orders are canonically equivalent anyway, and
> no (ordinary) Unicode process should depend on any specific order.

No, John, there are distinctions in Biblical Hebrew related to ordering, 
but due to the canonical combining classes these distinctions are all 
neutralized under canonical ordering / normalization. The alternate orders 
are canonically equivalent, but should not have been so.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485




Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-26 Thread Peter_Constable
Ken Whistler wrote on 06/25/2003 05:29:59 PM:

> > The point is that hiriq before patah is *not* 
> > canonically equivalent to patah before hiriq,
> 
> This is true. 
> 
> > except in the erroneous 
> > assumption of the Unicode Standard: the order of vowels makes words 
sound 
> > different and mean different things.
> 
> This is not.

Ken, I think you're reading John differently than he intended: the Unicode 
character sequences < hiriq, patah > and < patah, hiriq > *are* 
canonically equivalent, but the requirements for Biblical Hebrew are that 
alternate visual orders would correspond to different vocalizations, and 
thus the visual ordering of these does matter semantically, and therefore 
the encoded orders should *not* be canonically equivalent.


> The current situation is not optimal for implementations, nor
> does canonically ordered text follow traditional preferences
> for spelling order -- that we can agree on. But I think the
> claims of inadequacy for the representation or rendering
> of Biblical Hebrew text are overblown.

The serious problem is that the writing distinctions that matter cannot 
currently be reliably represented, as they are not preserved under 
canonical ordering / normalization. This is all just a rehash of 
discussions we had on this list back in December, at which time it was 
acknowledged that this was the case, and that this was a problem.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485