date:20040331

From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, March 31, 2004 8:38 AM
Subject: PUA properties, default or otherwise (was: Re: What is the principle?)

 This discussion has focused pretty tightly on the *default* properties
 of PUA code points, without really addressing the issue of specifying
 new properties to override those defaults, and I think that's a mistake.

Exactly what I was saying. But you had more arguments for my remark.

 But Ken and Rick
 are absolutely right that very few companies are going to see a business
 opportunity in this.  Even SC UniPad, which has implemented many
 comparatively arcane features of Unicode, has never done anything with
 the PUA, though it has been on their future versions list for 6 years
 now.

One of the main reason may be that they are exactly limited by the lack of
accurate properties for PUAs.
But I see no reason why there could not exist an interoperable format to send
these properties.
In proposed to include that information in fonts (notably OpenType), but it may
also be sent separately (in a font without the glyphs?)

Of course we can argue that some of the missing features may in some cases be
encoded directly within the maintext (for example by using RLO/PDF controls in
the plain-text to override the BiDi properties.

I also don't think that such application is only for idiosyncratic characters.
There are LOTS of scripts on earth that will probably never go to the scrutiny
of Unicode, but that users may wish to start studying in a interoperable way
with common reusable technical solutions to creater the documents they need. You
may think that using some rich text format (Word DOC, Acrobat PDF, HTML+SVG...)
would paliate the lack of standardization. But I do think that there is still
some place for plain texts.

Re: What is the principle?

From: Kenneth Whistler [EMAIL PROTECTED]
 Consider another example. The normalization algorithm has to work
 for *all* Unicode code points, assigned or not, because it guarantees
 stability into the future when characters are encoded at code points
 which were previously unencoded. It also, then, obviously has to
 work for PUA characters, as well. That implies that two additional
 properties *MUST* have some default values set for PUA characters.
 One of those is decomposition, which is defaulted to the null string
 (no decomposition) for all PUA characters. The other is canonical
 combining class, which is defaulted to ccc=0 for all PUA characters.
 Doing anything else would have just been stupid. But again,
 None of the Above was not an option.

All these arguments are in favor of a definition for default properties set with
reasonnable values that match the most common (?) needs. Still this should not
prohibit the use and interchange of other properties. and these defaults are
then not mandatory and are overridable.
Even in the case of an API that requires being able to do something like:
Character(E000).getProperty(), that API may be prefeeded with a table of
properties override for PUAs.

RE: Unicode 4.0.1 Released

2004-03-31 Thread Marco Cimarosti

Rick McGowan wrote:
 Unicode 4.0.1 has been released! [...]
 The main new features in Unicode 4.0.1 are the following:
 [...]
 3. Unicode Character Database:
 [...]
   * Changed: general category of U+200B ZERO WIDTH SPACE
   * Changed: bidi class of several characters

(If I am asking a FAQ, I apologize in advance...)

So far, my understanding was that the normative properties of existing code
points where carved in stone.

Won't these fixes break applications out there? I.e., won't they turn
previously conformant applications into non conformant ones?

_ Marco

Re: What is the principle?

From: Michael Everson [EMAIL PROTECTED]
 At 17:02 -0800 2004-03-30, Mike Ayers wrote:
 I feel obligated to take this one step further - these folks are
 forgetting that P stands for private.  Their use of this space
 is their own problem, in all senses.  It does not seem reasonable to
 me that *any* standard behavior could be expected of PUA code
 points, from operating systems or applications, as such may have
 chosen to, or may yet choose to, use those code points to
 encapsulate very un-font-rendering-like behavior, and such a
 decision, made past, present or future, is a perfectly valid private
 use.

 Which I assume means: it's wrong for Unicode to make ANY property
 pronouncements for ANY PUA characters, since that defines them, and
 removes the P from the Use.

Do you mean here that any properties currently defined in Unicode for PUAs
should be deprecated with their current normative value, and left to
implementers, so that no application can be said non-conforming if it implements
other defaults?
May be this would require some adjustments in the normative wordings related to
Unicode conformance...

And as well, variant selectors, if they are used on PUAs should not be
constrained as well (the current restrictions for variant selectors usage should
not apply to PUAs as well, given that a VSn should still be fully ignorable
including for PUAs that have no defined normative semantic in Unicode, meaning
that the combination of PUA+VSn has also no defined normative semantic in
Unicode itself).

Leave that for implementations, and may be we'll ease the development of new
scripts, by allowing other groups to work on some interchangeable formats based
on PUAs, which could then be later integrated in Unicode after an easier phase
where these scripts would have been experimented. It would ease the adoption of
a later consensus, and would offer a great tool for developers and searchers,
that could safely base their work based on Unicode encoding conventions

Also this would be a good indicator that specialized 8-bit code sets are no
longer necessary, and IANA could then close its 8-bit encodings registry, in
favor of PUA-based encodings defined by some conventional rules which could then
become a standard and open extension mechanism...

This will have the advantage of avoiding pressures on Unicode to normalize new
scripts too fast, and longer open experimentations would avoid many future
errors in the new normalized scripts.

The CSUR registry is one approach for the definition of new scripts, SIL.org has
its own, but for now I see little efforts to allow specifying these properties
in a partially interchangeable format, and one reason can be that Unicode has
made too many restrictions on the usage of PUAs, so that developers fear that
their protocols which need them become non conforming.

I do think that there must exist a way to have PUAs used safely without
ambiguities or risks of collisions, using extensions mechanisms similar to
namespaces in XML, and some normative declarations and possibly a registry of
PUA sets (why not the IANA charsets registry if it can reference the associated
properties with some URL to a script definition schema?).

Re: What is the principle?

On 30/03/2004 17:32, Michael Everson wrote:

At 17:02 -0800 2004-03-30, Mike Ayers wrote:

I feel obligated to take this one step further - these folks are 
forgetting that P stands for private.  Their use of this space is 
their own problem, in all senses.  It does not seem reasonable to me 
that *any* standard behavior could be expected of PUA code points, 
from operating systems or applications, as such may have chosen to, 
or may yet choose to, use those code points to encapsulate very 
un-font-rendering-like behavior, and such a decision, made past, 
present or future, is a perfectly valid private use.


Which I assume means: it's wrong for Unicode to make ANY property 
pronouncements for ANY PUA characters, since that defines them, and 
removes the P from the Use.

This is of course a principle which they have already broken, as they 
have defined default properties for all of them. Although in principle 
people can implement non-default properties, no one has, as far as I 
know. The result is that in practice the P has been removed from the PUA 
and it has been restricted to LTR base characters.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

On 30/03/2004 16:46, Kenneth Whistler wrote:

...

Work it out. Any proposal to assign property ranges into the PUA
would run up on the rocks of all the details. And *then* it would
meet a stonewall in the UTC. And *then* it would meet another stonewall
in SC2.
Quit banging your head against the walls and look for alternatives
more likely to lead somewhere.
 

The only alternative I see is to rewrite from scratch the display 
routines of my favourite OS. I think banging my head against walls is 
likely to be faster. After all, even the hardest wall cracks eventually, 
and my head is quite hard.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread Antoine Leca

On Tuesday, March 30, 2004 11:42 PM, Ernest Cline va escriure:

 The main usage is with compound words such as ice cream or
 Louis XIV or commercial phrases such as Camry SE where for
 esthetic reasons an author would prefer that the space not expand
 upon justification,

Well, as one that takes the pain to enter ALT+0160 here and there
(particularly around « and » in French), I should say that I certainly would
like the space between Louis and XIV, or between Camry and SE to stay of
fixed width; on the other hand, I would expect the one between ice and cream
to expand according to the rythm of the paragraph, in order to not break the
reading. Like in

Mum,   I   want   an   ice   cream

against

Mum,   Iwant   anice cream

 I am not aware of any style guides that offer either
 normative or informative guidance for either choice.

The French guides of styles (after all, we can use Unicode to write French
as well as English, can't we?) generally say that NBSP should not be
expanded on justification. I do not know right now (I miss access to
definitive references) if this is general to all non-breaking spaces,
including those that do have fixed-width per se, or if it specifically
applies to U+00A0. It should be outlined that non-breaking spaces occur
rather frequently in French (around several punctuation characters), and
because many word processors are not rich enough to encode it as it should
(i.e., as ZWNBSP+THSP+ZWNBSP, \uFEFF\u2009\uFEFF), well they encode it as
U+00A0 :-(.


 NBSP ZWNJ breaks, but should it justify?
^^
This is an error, isn't it?


Antoine

RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?

2004-03-31 Thread Kent Karlsson


[EMAIL PROTECTED] wrote:
 Thai (and Lao, whose encoding closely parallels that of Thai) are
 encoded in Unicode on unique principles:  by a straight left-to-right
 typewriter-style encoding.  This was done for compatibility with the
 pervasive Thai 8-bit standard.  It also means that for collation
purposes
 what are historically left-side vowels must be moved after 
 the following consonant.

For more on collation of Thai, Lao, and Khmer, see the proposed update
to
ISO/IEC 14651 CTT (and the UAX 10 DUCET), and a tailoring for the CTT,
in the two documents:
N2718 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2718.doc
N2717 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2717.doc

(Note that the swapping part for Thai/Lao of the tailoring is dealt
with
by other means (in the prehandling) in the Unicode collation algorithm.)

 Note that the Thai characters are not labeled LETTER or VOWEL SIGN or
 what have you, but simply CHARACTER.

Yes, but that has no particular consequence. Note that the vowel signs
are in the documents referenced above treated as vowel signs, regardless
of if they are called LETTER, VOWEL SIGN, or CHARACTER (and,
actually, regardless of their general category, as it happens). There is
also the complication that some of the consonant characters are
logically used as vowel (parts), but the modern convention is to ignore
that in the collation rules, and always treat them as consonants in
collation.

/kent k

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

On 30/03/2004 18:01, fantasai wrote:

Ernest Cline wrote:

The main usage is with compound words such as ice cream or
Louis XIV or commercial phrases such as Camry SE where for
esthetic reasons an author would prefer that the space not expand
upon justification,


Given wide enough measures, good text layout program should be able
to produce justified text without very noticeable changes in word
spacing.
NBSP doesn't break, but should it justify?


I believe NBSP should be, to the reader, indistinguishable from a
regular space. It does not have a semantic function as a compound-
word-joiner; it's just a space that doesn't break, and therefore
should be treated like any other space.
~fantasai

So perhaps the best thing to do in cases like Ernest's and mine, where a 
fixed width non-breaking space is required, is to use FIGURE SPACE, 
which I understand is non-breaking. But then perhaps this is too wide in 
some circumstances - in many fonts it is twice the regular width of SPACE.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

sara am ordering (was RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?

2004-03-31 Thread Peter Constable

Kent:

Your doc says,

quote, emphasis added
And  Ó should be ordered as Ò followed by  í (**which is the logical sequence, despite 
the Unicode compatibility decomposition**).
/quote

What do you mean here by logical sequence? That that's how it should be interpreted 
phonologically and for sorting purposes, or that that is the correct encoded sequence 
for decomposed representations?

If the latter, that seems to me to be quite wrong: I would not expect *any* data that 
includes a decomposed representation of sara am to have the sequence  C, sara aa, 
nikkahit : it would always be the other way around:  C, nikkahit, sara aa .

Of course, if the former, I would agree.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: What is the principle?

2004-03-31 Thread Doug Ewell

Peter Kirk peterkirk at qaya dot org wrote:

 Which I assume means: it's wrong for Unicode to make ANY property
 pronouncements for ANY PUA characters, since that defines them, and
 removes the P from the Use.

 This is of course a principle which they have already broken, as they
 have defined default properties for all of them. Although in
 principle people can implement non-default properties, no one has, as
 far as I know. The result is that in practice the P has been removed
 from the PUA and it has been restricted to LTR base characters.

Unicode allows the properties of the PUA code points, unlike all others,
to be customized by the end user.  I've done so myself, on the Web page
I mentioned.  Characters are classified as General Category Lo, Nd, or
No, and the digits have numeric values.  Although all are still LTR base
characters, there's no reason they had to be (except that that's how my
script works); for Tengwar there would be both RTL digits and combining
marks.

The perception that no-one has yet implemented custom PUA properties
does not mean that doing so is prohibited or unworkable, any more than
the shortage of widely available rendering engines for the Tibetan and
Khmer encoding models implies that those models are unworkable.

Failure to see this distinction, between (a) what Unicode allows and
prohibits and (b) what software products do and do not support, is doing
more to convince us of the hardness of Peter's head than anything else.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: French typographic thin space (was: Fixed Width Spaces)

From: Antoine Leca [EMAIL PROTECTED]
 The French guides of styles (after all, we can use Unicode to write French
 as well as English, can't we?) generally say that NBSP should not be
 expanded on justification. I do not know right now (I miss access to
 definitive references) if this is general to all non-breaking spaces,
 including those that do have fixed-width per se, or if it specifically
 applies to U+00A0. It should be outlined that non-breaking spaces occur
 rather frequently in French (around several punctuation characters), and
 because many word processors are not rich enough to encode it as it should
 (i.e., as ZWNBSP+THSP+ZWNBSP, \uFEFF\u2009\uFEFF), well they encode it as
 U+00A0 :-(.

In fact the French typographic tradition for French is to use a THIN
non-breaking space, which is not what NBSP encodes precisely, but what is used
as a common APPROXIMATION simply because the THIN non-justifiable and
non-breaking space is absent from legacu 8-bit sets (including ISO-8859-1,
ISO-8859-15, Windows 1252, CP850, for the most widely used ones).

The rule is to use this thins space (called une fine or une espace fine in
French) before composed punctuations with two separated glyphs: the colon,
semi-colon, exclamation point and interrogation point, and between « and the
quoted phrase, and between the quoted phrase and ».

A similar rule exists also in traditional English typography, however there's a
small variant here: the French thin space is a bit wider than the English one,
so the best approximation for French is to use NBSP, and for English to use
nothing (also because most fonts made by English typographers already
incorporate the additional very thin space within the spacing width of the
punctuation mark)...

There are pros and cons with the NBSP approximation used in French. Some have
argued that it would be better to not encode anything here, and instead to use
fonts containing punctuation marks that already include the appropriate
additional spacing within the glyph spacing width.

Still, many French typography composition engines (notably those by newspapers,
magazines, guides and diaries -- for example the French product Calligrame
distributed by X-Media in various countries, or other composition engines used
by regional or national newspapers) already recognize the sequence
NBSP+punctuation or punctuation+NBSP and interpret the NBSP code as meaning the
presence of the French espace fine, so printed books, newspapers and magazines
already apply the correct style (these newspapers in Frnace are already used
since long to use SGML to create their laser masters, and to use quite advanced,
precise nd coherent stylesheets, that are part of the signature of the
publication, i.e. its maquette design, that also incorporates many custom
logographs and symbols, notably in dictionnaries, guides and newspapers).

So yes the correct code for French should be ZWNBSP+THSP+ZWNBSP (but beware of
the difference of spacing between the English and French thin space, with one at
1/6 em, the other at 1/8 em...)

Re: What is the principle?

2004-03-31 Thread Language Analysis Systems, Inc. Unicode list reader

On 31/03/2004 08:08, Doug Ewell wrote:

...

The perception that no-one has yet implemented custom PUA properties
does not mean that doing so is prohibited or unworkable, any more than
the shortage of widely available rendering engines for the Tibetan and
Khmer encoding models implies that those models are unworkable.
Failure to see this distinction, between (a) what Unicode allows and
prohibits and (b) what software products do and do not support, is doing
more to convince us of the hardness of Peter's head than anything else.
 

Doug, I don't know who you are accusing of failing to see this 
distinction, but it certainly isn't me. I have made it very clear 
several times that I understand that IN PRINCIPLE I am free to write my 
own operating system, or a large part of it, to display these characters 
as I wish. The problem is one IN PRACTICE.

Your advice reminds me of the advice that might have been given to 
Burbage (?) not to hire Shakespeare, but rather to use a team of monkeys 
because given enough time they would write the same plays - true, but 
not practical. The ones I am comparing to monkeys are would-be PUA users 
like myself who are no more capable than monkeys of writing OSs in a 
sensible time frame. (Sadly there are no OSs in the Shakespeare 
category.) :-)

But this practical problem would go away (in time, but a lot less time 
than it would take me to write an OS!) if Unicode specified different 
DEFAULT (read only ones supported in any commercial or open source 
software) properties for parts of the PUA, and the software companies 
implemented this - which would be trivial if specified.

You claim to have customised the properties of PUA characters. Do you 
mean that you have written software which processes them according to 
your customisations? It is easy to list properties. It is very hard to 
implement them, if one has to start from scratch, without any help from 
the established manufacturers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

So perhaps the best thing to do in cases like Ernest's and mine, where
a 
fixed width non-breaking space is required, is to use FIGURE SPACE, 
which I understand is non-breaking. But then perhaps this is too wide
in 
some circumstances - in many fonts it is twice the regular width of
SPACE.

Going out on a limb here...

It sorta seems like the need to keep phrases like Louis XIV together
is a valid one the deserves a solution, but it also seems fairly
esoteric-- typesetters and people who give a lot of thought to the
presentation of their text might use this, but most people wouldn't.
This makes me wonder if it's a plain-text thing.

I'm not saying this is a problem that should be solved through markup,
but if you care enough about the presentation of the text to care about
this, you're probably also already using styled text to specify other
things you care about, such as the font you're using.  And if you know
what font you're using, you can use THREE-PER-EM SPACE or FOUR-PER-EM
SPACE (or maybe SIX-PER-EM SPACE or FIGURE SPACE), because you know
which one is the right width in your font.

For that matter, if a typical space is usually either a third or an em
or a quarter of an em wide, my guess is you could probably use either
THREE-PER-EM SPACE or FOUR-PER-EM SPACE anyway, and even if this didn't
exactly match the width of a space in the particular font used to render
your text, it'd probably still look okay.  But then again, I'm not a
typographer.

Fading back into the background...

--Rich Gillam
  Language Analysis Systems, Inc.

Re: What is the principle?

2004-03-31 Thread Youtie Effaight

On: 2004-03-31 06:43:38 -0800 Peter Kirk peterkirk at qaya.org scribed:
The only alternative I see is to rewrite from scratch the display
routines of my favourite OS. I think banging my head against walls is
likely to be faster. After all, even the hardest wall cracks eventually,
and my head is quite hard.


Bang on, O Mighty One!

Yer ol' Pal,
Youtie
_
Get tax tips, tools and access to IRS forms  all in one place at MSN Money! 
http://moneycentral.msn.com/tax/home.asp

Re: What is the principle?

On 30/03/2004 16:30, Kenneth Whistler wrote:

...

Uh, sorry, Peter, but the implications here are so much b, err, ...
baloney.
The majority of the world's scripts are left-to-right. They also
happen to be non-Western. There are more *Indic* scripts encoded
in the Unicode Standard than *Western* scripts.
The majority of *entities* that the majority of users put into
PUA characters in actual application usage are unencoded CJK
ideograph variants and symbols from Asian code pages. It was
primarily the need to accomodate those *Eastern* users that drove
the setting of default values for the PUA.
 

OK, in that case let's allocate properties to PUA characters in 
proportion to the number of RTL vs LTR scripts, and the proportion of 
combining marks vs. base characters, in actual encoded scripts. The 
majority of PUA characters are unchanged. A significant minority become 
RTL or non-spacing.

A lot of effort has gone into accommodating certain *Eastern* users. 
Something like 100,000 CJK characters have already been defined, and 
already that is not enough and they have requisitioned two more planes 
of PUA with LTR properties. Fair enough if they might be needed. But 
what if users of certain other scripts e.g. RTL scripts want just a 
handful of PUA characters with the properties they need? Why is 
preference given to CJK? This sounds like bias to me even if I was wrong 
to call it western.

This bias is also reflected in their 
system software which (as far as I know with no exceptions) does not 
allow users to specify properties for PUA characters other than the 
default decided by the UTC.
   

Bias? Or business sense?

If you want some specialized behavior for software, you either
write it yourself, or pay someone to write it, or convince someone
else that adding such a feature to the software *they* write
will pay for the investment cost in terms of incremental
increased sales.
You may not like how the software industry works, but thems
the breaks for any mature industry.
 

Well, I don't quite see why it is business sense for software companies 
to support the huge PUAs for variant CJK characters, outside the 100,000 
or so already defined by Unicode. I do understand that it is business 
sense not to support user specification of properties, because that 
would be hard work for little or no gain.

...

Scenario: The UTC listens to you and defines some section of the PUA
as strong right-to-left by default for use in PUA-defined bidirectional
scripts. Somebody else is *already* using that section of the PUA
for something else. Now they have an interoperability problem,
because the default behavior they were depending on changes over
in some future version of some software, not under their control,
and they data gets munged by bidi.
 

Well, they weren't supposed to rely on these default properties anyway, 
they were supposed to use the PUA at their own risk. They are not the 
only ones who are messed up by features of software which is not under 
their control. But it might be preferable in practice to define an 
additional PUA with RTL properties and one with default ignorable 
properties, outside all of the existing PUAs. I am not asking for a 
large space; very likely 256 characters of each type would be more than 
adequate.

This is the kind of stuff the UTC refuses to start up by trying
to provide some subdivision of semantics in the PUA. *That* is
the principle, by the way, which guides the UTC position on
the PUA: Use at your own risk, by private agreement.
 

What 
we do want is compatibility between our applications and the system 
software, and this proposal is the way to do that.
   

I don't see how any proposal to create some particular behavior
in the PUA is a way to accomplish that.
 

If a new PUA is created with default RTL properties, one can expect that 
system software will soon support it at least to the extent of defining 
these characters as RTL for bidi algorithm etc purposes. Similarly with 
default ignorable.

 

...

A default value for a property is not a requirement by the UTC
*ON AN IMPLEMENTER* that they use that value. They can use whatever
property values they desire, but if they depart from what system
platforms provide them (by default) then they are buying themselves
an implementation task to get characters to do what they want.
 

Ken, you are a master of understatement. The task they are buying 
themselves is a rewrite of the whole system. Companies don't provide the 
details needed for others to customise individual modules, and it would 
probably be a breach of copyright etc to attempt to do so. Open Source 
is different here, of course.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

On 31/03/2004 08:49, Language Analysis Systems, Inc. Unicode list reader 
wrote:

So perhaps the best thing to do in cases like Ernest's and mine, where
   

a 
 

fixed width non-breaking space is required, is to use FIGURE SPACE, 
which I understand is non-breaking. But then perhaps this is too wide
   

in 
 

some circumstances - in many fonts it is twice the regular width of
   

SPACE.

Going out on a limb here...

It sorta seems like the need to keep phrases like Louis XIV together
is a valid one the deserves a solution, but it also seems fairly
esoteric-- typesetters and people who give a lot of thought to the
presentation of their text might use this, but most people wouldn't.
This makes me wonder if it's a plain-text thing.
I'm not saying this is a problem that should be solved through markup,
but if you care enough about the presentation of the text to care about
this, you're probably also already using styled text to specify other
things you care about, such as the font you're using.  And if you know
what font you're using, you can use THREE-PER-EM SPACE or FOUR-PER-EM
SPACE (or maybe SIX-PER-EM SPACE or FIGURE SPACE), because you know
which one is the right width in your font.
For that matter, if a typical space is usually either a third or an em
or a quarter of an em wide, my guess is you could probably use either
THREE-PER-EM SPACE or FOUR-PER-EM SPACE anyway, and even if this didn't
exactly match the width of a space in the particular font used to render
your text, it'd probably still look okay.  But then again, I'm not a
typographer.
Fading back into the background...

--Rich Gillam
 Language Analysis Systems, Inc.


 

Fair enough. To most people, a space is a space. To rather more, there 
is a second kind of space which they expect to be non-breaking and often 
also expect to be fixed width. (Those who had the latter expectation 
have had a nasty surprise today because with the release of 4.0.1 NBSP 
is suddenly no longer fixed width.) The problem is that when we get 
beyond that we get lost in a world of typography, and in uncertainty 
over which spaces are supposed to be breaking or non-breaking, fixed or 
variable width, and if fixed what width. It would be useful to have all 
of this clearly laid out somewhere, so that those of us who do care 
about what our text looks like, but are not professional typographers, 
know what we should use.

LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't want 
his number also to fall off on to the next line, or even to become too 
far separated from his name. We need to know what kind of space to use 
to resist the guillotine!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: What is the principle?

2004-03-31 Thread Mike Ayers

Title: RE: What is the principle?

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Peter Kirk
Sent: Wednesday, March 31, 2004 9:12 AM

On 30/03/2004 16:30, Kenneth Whistler wrote:

Oh, yes, Peter, you have a identified a clear bias against... against... against... uh, certain hypothetical situations?

If you want some specialized behavior for software, you either
write it yourself, or pay someone to write it, or convince someone
else that adding such a feature to the software *they* write
will pay for the investment cost in terms of incremental
increased sales.

You may not like how the software industry works, but thems
the breaks for any mature industry.

Well, I don't quite see why it is business sense for software
companies
to support the huge PUAs for variant CJK characters, outside

Support? ROFL! Call up one of those companies and tell them that you are having trouble displaying PUA fonts, eastern or otherwise. I'd like to snoop on that call.

they were supposed to use the PUA at their own risk.

Well, gee, somebody understands that principle so clearly WHEN IT APPLIES TO SOMEONE ELSE.

This is the kind of stuff the UTC refuses to start up by trying
to provide some subdivision of semantics in the PUA. *That* is
the principle, by the way, which guides the UTC position on
the PUA: Use at your own risk, by private agreement.

...and quit bothering us about it. That's gotta be in there somewhere. If not, I have an amendment to propose.

What
we do want is compatibility between our applications and the system
software, and this proposal is the way to do that.

No. The *only* way to maintain compatibility between your applications and the system software is to ensure that your applications only do things that are supported by the system software. If you want RTL PUA, ask your system software vendor. Here, you're just whining into the wind.

/|/|ike

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

Language Analysis Systems, Inc. Unicode list reader scripsit:

 It sorta seems like the need to keep phrases like Louis XIV together
 is a valid one the deserves a solution, but it also seems fairly
 esoteric-- typesetters and people who give a lot of thought to the
 presentation of their text might use this, but most people wouldn't.
 This makes me wonder if it's a plain-text thing.

In the TeX typesetting tradition, at least, it *is* done by markup.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today. 
   --Specht v. Netscape

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread fantasai

Peter Kirk wrote:
LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't want 
his number also to fall off on to the next line, or even to become too 
far separated from his name. We need to know what kind of space to use 
to resist the guillotine!
NBSP

You should not rely on fixed-width spaces to approximate regular spaces.
A simple switch from Arial Narrow to Verdana will demonstrate why: the
widths of normal spaces and non-breaking spaces are related to the width
of the fonts' glyphs, whereas the width of a fixed-width space is related
to the /height/ of the glyphs.
--
http://fantasai.inkedblade.net/contact

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

On 31/03/2004 11:57, Kenneth Whistler wrote:

... To most people, a space is a space. To rather more, there 
is a second kind of space which they expect to be non-breaking and often 
also expect to be fixed width. (Those who had the latter expectation 
have had a nasty surprise today because with the release of 4.0.1 NBSP 
is suddenly no longer fixed width.) 
   



Hardly. It has *always* been the intent and understanding of
the UTC that NBSP was comparable in all ways to a SPACE character,
except for disallowing line break opportunities.

...



 

Thanks for the clarification. I should say that the behaviour of NBSP 
suddenly reverted to what it had been in previous versions of the 
standard, although a perhaps inadvertant change was made in 4.0.0.

Nevertheless, there does seem to be a widespread misunderstanding that 
NBSP is intended to be fixed width, and in many systems it is 
implemented as such. Perhaps there is a need to clarify this further, 
perhaps by reinstating text similar to what was in Unicode 3.0.

I take your point about the advantages of having the drafters of the 
standard available to explain parts of the standard which are unclear. I 
certainly wish we could do that with other texts that you allude to. But 
there must also be controls here. If the text says black, we can't 
have the drafters saying that the text really means white. They can 
say that they made a mistake, and correct it in a new version, but there 
are limits on how far they can reinterpret even a text which they wrote 
themselves.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

 [Original Message]
 From: Kenneth Whistler [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]

 Peter Kirk continued:

  You can do it privately. See above. But attempting to do such things
  in terms of formally specified usages of the PUA is an invitation
  to failure of interoperability.

  I don't understand this last comment. 

 Scenario: The UTC listens to you and defines some section of the PUA
 as strong right-to-left by default for use in PUA-defined bidirectional
 scripts. Somebody else is *already* using that section of the PUA
 for something else. Now they have an interoperability problem,
 because the default behavior they were depending on changes over
 in some future version of some software, not under their control,
 and they data gets munged by bidi.

 This is the kind of stuff the UTC refuses to start up by trying
 to provide some subdivision of semantics in the PUA. *That* is
 the principle, by the way, which guides the UTC position on
 the PUA: Use at your own risk, by private agreement.

Which is why if any private use characters with default characteristics
other than those of the existing Private Use blocks are ever to be part of
Unicode they will need to be added as additional Private Use blocks,
not by redefining existing PUA's

There are currently some 10 totally unused planes, with not even any
tentative plans for them,  Allocating one or two those into additional
Private Use Areas with a variety of default characteristics instead of
the monotonous default characteristics of the existing Private Use
Areas should not prove too difficult.  For example, 26 blocks of 128
Private Use Combining Marks each, each block corresponding to
one of the existing canonical combining classes (with perhaps a
larger block for class 0) would amply satisfy the needs of most
private use scripts for combining marks. Similarly, blocks for
additional characters that would have other properties should
be simple to define and for most combinations of property values,
128 characters should also prove to be exceedingly ample

I'd have to take the time to list them, but a quick glance convinces
me that there are at most several hundred combinations that would
need to be supported if we limit things to just those combinations
already in use.  (it might take more, if for example all 256 potential
combining classes were supported instead of the 26 listed in
UCD.html),  At 128 characters per combination plus more for a
few that might need them, it should prove possible to handle this
in 1 or 2 planes.

Re: What is the principle?

2004-03-31 Thread Rick McGowan

Peter Kirk wrote...

 ... I have a real requirement. The UTC has the power to meet my requirement,
 and to do so rather simply. I am asking them to meet it.

Actually, you are not asking UTC anything. You are discussing the PUA on a  
public-access mail list. There's a big difference. This *is* the place to  
discuss as you are doing, and a good place to formulate your positions for  
eventual submission of a proposal, if any.

Once you have formulated a position and you actually want to ask the UTC  
to do something or vote on something, then please fill out the form:

http://www.unicode.org/reporting.html

That's one place to start. If you have more text than will fit in the  
form, or you wish to submit a PDF or other bulky document with circles and  
arrows and such things to UTC, then please discuss it with me off list. I  
will assist you and see that your document is forwarded and submitted  
appropriately into the UTC process. See also RFC 3718, particularly  
sections 8 and 9.

As usual, this is all my own opinion and reflects no official policy or position.

Rick

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

Peter continued:

 Thanks for the clarification. I should say that the behaviour of NBSP 
 suddenly reverted to what it had been in previous versions of the 
 standard, although a perhaps inadvertant change was made in 4.0.0.

Even that is not correct.

The *Introduction* to UAX #14 was expanded by 3 paragraphs between
the Unicode 3.2.0 and the Unicode 4.0.0 version, in an attempt to
help explain the context of how a line break algorithm works, by
measuring lines and then seeking a locally optimal line break. In
that context, the issue of how compression or expansion of a line
works under justification was raised, and the author of UAX #14
added some explanatory qualifications regarding what spaces are
involved in the kinds of compression and expansion which can impact
line measurement and thus the choice of optimal line break positions.

That text omitted mention of NBSP as parallel to SPACE in that
context -- that was an oversight by the author and not caught in
editorial review. When it became clear that the paragraph in question
was being (erroneously) cited as proving that the intent of the
UTC was that NBSP be implemented as a fixed-width space, the
author acknowledged the oversight and quickly fixed the text.

There is *NO* UTC decision on record to make the NBSP be a fixed-width
space, in the history of its decision making.
 
 Nevertheless, there does seem to be a widespread misunderstanding that 
 NBSP is intended to be fixed width, and in many systems it is 
 implemented as such. Perhaps there is a need to clarify this further, 
 perhaps by reinstating text similar to what was in Unicode 3.0.

I didn't cite the parallel text from Unicode 4.0 along with the
Unicode 1.0, Unicode 2.0, and Unicode 3.0 text I quoted, for the
simple reason that it is almost word-for-word identical to
Unicode 3.0. There is no need to reinstate any text -- it was
unchanged and its intent was unchanged.

 
 I take your point about the advantages of having the drafters of the 
 standard available to explain parts of the standard which are unclear. I 
 certainly wish we could do that with other texts that you allude to. But 
 there must also be controls here. If the text says black, we can't 
 have the drafters saying that the text really means white. They can 
 say that they made a mistake, and correct it in a new version, but there 
 are limits on how far they can reinterpret even a text which they wrote 
 themselves.

Of course. Exegesis provided above.

Now please stop claiming that the status of NBSP has changed,
either pre- or post-4.0.0.

That some implementations treat NBSP as fixed-width is a matter
of those implementations. Note that even SPACE is treated as
fixed-width by some implementations, and has a long history of
that. Any implementation that is mono-pitch has a fixed-width
SPACE, and that goes back to the dark prehistory of SPACE as
a Teletype character.

The Unicode Standard does not require that SPACE or NBSP be
fixed-width, nor does it preclude an implementation which,
for whatever reason (limitations of mechanical rendering,
font design, or simply aesthetics) treats them as fixed-width.

The point the standard is making is that the nominally
*fixed-width* space characters (U+2000..U+200A, U+3000) are,
by their very character identity, associated with particular
display widths. But even for those, as UAX #14 notes, there are
typographical practices which may result, for example, in
an ideographic space character being compressed or a
thin space character being expanded. What *matters* is that
the encoded content of the text be correctly specified in
an interoperable manner and that proper typographic practice
be followed to produce the rendered results that people desire.
The Unicode Standard provides a large number of space
characters to assist that. But if even this most elaborate
set of encoded space characters in the history of character
encoding standards does not suffice, then, as for TeX, you
always have the option to move to mark-up to get the desired
results.

--Ken

Re: What is the principle?

On 31/03/2004 10:44, Mike Ayers wrote:

...


 Well, I don't quite see why it is business sense for software
 companies
 to support the huge PUAs for variant CJK characters, outside
Support?  ROFL!  Call up one of those companies and tell them 
that you are having trouble displaying PUA fonts, eastern or 
otherwise.  I'd like to snoop on that call.

Well, support has a range of meanings. Call up one of those companies 
and tell them you are having trouble with one of the Indic or SE Asian 
scripts which they do claim to support, and I suspect you will discover 
what that support really means in practice - unless you can get through 
to the specialised development team. What I meant by support in this 
case was more that in the UTC they voted in favour of assigning two HUGE 
PUAs, consisting of more than one eighth of the entire Unicode code 
space, for variant CJK characters; and that in practice these characters 
can be displayed successfully with a variety of software from these 
companies. If CJK merits more than 100,000 PUA characters as well as a 
similar number of defined characters, why can't a measly two or three, 
or better 256 or so, be allowed for RTL languages and for combining marks?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

On 31/03/2004 10:44, Mike Ayers wrote:

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Peter Kirk
Sent: Wednesday, March 31, 2004 9:12 AM
On 30/03/2004 16:30, Kenneth Whistler wrote:

But
what if users of certain other scripts e.g. RTL scripts want just a
handful of PUA characters with the properties they need? Why is
preference given to CJK? This sounds like bias to me even if
I was wrong
to call it western.
Oh, yes, Peter, you have a identified a clear bias against...
against... against... uh, certain hypothetical situations?

Well, if you haven't read it between the lines, the clear bias is
against RTL scripts and those scripts (including Indic by the way) which
use combining characters. There is no way currently (with the default
properties) to support PUA characters relating to such scripts, although
there is for western and CJK scripts.

...

they were supposed to use the PUA at their own risk.

Well, gee, somebody understands that principle so clearly WHEN
IT APPLIES TO SOMEONE ELSE.

Yes, Ken! Read the context and don't snip it. He is the one who said
(correctly) that what I get when I use the PUA must be at my own risk,
but, I quote:

Somebody else is *already* using that section of the PUA for something else. Now they have an interoperability problem...

Why is their interoperability problem something which the UTC cares
about, when mine isn't? Why doesn't the use at your own risk principle
apply to them just as much as to me?

...

No. The *only* way to maintain compatibility between your
applications and the system software is to ensure that your
applications only do things that are supported by the system
software. If you want RTL PUA, ask your system software vendor.
Here, you're just whining into the wind.

If you want me to quit whining, quit asking me to do things which you
and I know very well are a waste of time. System software vendors are
not going to do what I want, and we all know that very well. But I have
a real requirement. The UTC has the power to meet my requirement, and to
do so rather simply. I am asking them to meet it.

Actually my current requirement is not so much for RTL PUA as for PUA
variation selectors and/or combining characters which are default
ignorable. RTL PUA is not so much of a problem, because at least in
principle it should be possible to make PUA characters RTL by enclosing
them in RLO ... PDF. I am not sure how well this is actually supported
by system software. My current requirement could be met by defining a
probably quite small set of PUA combining characters (with combining
class zero) which would be default ignorable. For an example of why this
might be useful, see my posting today to the Unicode Hebrew list.

/|/|ike

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

On 31/03/2004 12:40, Rick McGowan wrote:

Peter Kirk wrote...

 

... I have a real requirement. The UTC has the power to meet my requirement,
and to do so rather simply. I am asking them to meet it.
   

Actually, you are not asking UTC anything. You are discussing the PUA on a  
public-access mail list. There's a big difference. This *is* the place to  
discuss as you are doing, and a good place to formulate your positions for  
eventual submission of a proposal, if any.
 

Thanks for the clarification. I was aware of the distinction, and was 
using am asking loosely. I am undecided yet whether to make a formal 
proposal. Ken seems to suggest that this would be a waste of time - 
although I can see some advantages in obtaining a formal rejection. I 
wonder if anyone else on the UTC or associated with it might give some 
hope for such a proposal?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

On 31/03/2004 12:27, fantasai wrote:

Peter Kirk wrote:

LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't 
want his number also to fall off on to the next line, or even to 
become too far separated from his name. We need to know what kind of 
space to use to resist the guillotine!


NBSP

You should not rely on fixed-width spaces to approximate regular spaces.
A simple switch from Arial Narrow to Verdana will demonstrate why: the
widths of normal spaces and non-breaking spaces are related to the width
of the fonts' glyphs, whereas the width of a fixed-width space is related
to the /height/ of the glyphs.
But, as Ken has just clarified, with NBSP Louis' neck may be stretched 
rather uncomfortably, if not cut completely. Here is what I don't want 
to see (fixed width font required):

Louis   XVI   was
guillotinedin
1793.
Here is what I do want:

Louis XVI was
guillotinedin
1793.
These columns are unrealistically narrow to make the point clear, 
although such narrow columns are sometimes found in newspapers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread Jony Rosenne

The NBSP issue was extensively discussed a couple of years ago, I don't
remember in which list. In short, it was wrongly used by early web users as
a fixed width space, and there is such a vast legacy it cannot be changed.
However, there are other applications that use the intended meaning - see
ISO 8859. 

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk
 Sent: Wednesday, March 31, 2004 10:13 PM
 To: Kenneth Whistler
 Cc: [EMAIL PROTECTED]
 Subject: Re: Fixed Width Spaces (was: Printing and Displaying 
 DependentVowels)
 
 
 On 31/03/2004 11:57, Kenneth Whistler wrote:
 
 ... To most people, a space is a space. To rather more, there
 is a second kind of space which they expect to be 
 non-breaking and often 
 also expect to be fixed width. (Those who had the latter 
 expectation 
 have had a nasty surprise today because with the release of 
 4.0.1 NBSP 
 is suddenly no longer fixed width.) 
 
 
  
  
 Hardly. It has *always* been the intent and understanding of the UTC 
 that NBSP was comparable in all ways to a SPACE character, 
 except for 
 disallowing line break opportunities.
 
 ...
 
 
 
   
 
 Thanks for the clarification. I should say that the behaviour of NBSP 
 suddenly reverted to what it had been in previous versions of the 
 standard, although a perhaps inadvertant change was made in 4.0.0.
 
 Nevertheless, there does seem to be a widespread 
 misunderstanding that 
 NBSP is intended to be fixed width, and in many systems it is 
 implemented as such. Perhaps there is a need to clarify this further, 
 perhaps by reinstating text similar to what was in Unicode 3.0.
 
 I take your point about the advantages of having the drafters of the 
 standard available to explain parts of the standard which are 
 unclear. I 
 certainly wish we could do that with other texts that you 
 allude to. But 
 there must also be controls here. If the text says black, we can't 
 have the drafters saying that the text really means white. They can 
 say that they made a mistake, and correct it in a new 
 version, but there 
 are limits on how far they can reinterpret even a text which 
 they wrote 
 themselves.
 
 -- 
 Peter Kirk
 [EMAIL PROTECTED] (personal)
 [EMAIL PROTECTED] (work)
 http://www.qaya.org/

RE: What is the principle?

2004-03-31 Thread Peter Constable

No. The *only* way to maintain compatibility between your applications 
 and the system software is to ensure that your applications only do things 
 that are supported by the system software.

If what is meant here by your applications is any applications running on your 
system, then that is correct. If it means applications you have developed, then I'd 
suggest a revision: whenever your application depends upon system-supplied services, 
it must do things in the ways expected by those services; if those services don't 
serve the needs of an application, you must implement that functionality on your own.

E.g. SIL's Graphite technology can deal with RTL PUA characters, but then it isn't 
relying on system-supplied services to do complex-script shaping of text.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: What is the principle?

Ernest suggested:

 There are currently some 10 totally unused planes, with not even any
 tentative plans for them,  Allocating one or two those into additional
 Private Use Areas with a variety of default characteristics instead of
 the monotonous default characteristics of the existing Private Use
 Areas should not prove too difficult.  

Fine. Make your formal proposal to the UTC and to SC2/WG2 and
see whether it is difficult or not to convince the committees
of the appropriateness of your approach.

 For example, 26 blocks of 128
 Private Use Combining Marks each, each block corresponding to
 one of the existing canonical combining classes (with perhaps a
 larger block for class 0) would amply satisfy the needs of most
 private use scripts for combining marks. Similarly, blocks for
 additional characters that would have other properties 


which would be what, exactly?

 should
 be simple to define and for most combinations of property values,
  
  
which would be what, exactly?

As of Unicode 4.0.1, PropertyAliases.txt now lists 82 distinct
character properties. Some of those, particularly those most
relevant to complex script behavior and rendering, such as
General_Category, Bidi_Class, Canonical_Combining_Class, Joining_Type,
etc., are multi-valued. Do you have any idea how big the numbers
start getting when combinatorics start to get involved here?

Or are you planning to do the research first, via a comprehensive
implementation of character properties such as IUC, to first
determine what the actual existing number of combinations of
property values is for the existing repertoire and properties
and then make a principled projection of that into the
uncertain world of characters for scripts which have not yet
been encoded or modeled?
  
 128 characters should also prove to be exceedingly ample

For what?

 I'd have to take the time to list them, but a quick glance convinces
 me that there are at most several hundred combinations that would
 need to be supported if we limit things to just those combinations
 already in use. 

This may be correct, but you'd have to make the case based
on the existing data from property implementations.

 (it might take more, if for example all 256 potential
 combining classes were supported instead of the 26 listed in
 UCD.html),  At 128 characters per combination plus more for a
 few that might need them, it should prove possible to handle this
 in 1 or 2 planes.

Which still begs the fundamental questions:

Why this scheme instead of a much more flexible scheme, as
outlined by Rick, for having an implementation with API support
for establishing PUA properties on an as-needed basis? (Which
requires *no* action by the UTC at all, by the way.)

What makes you think, once you have such a scheme of property
combinations worked out, and once you convinced the UTC of
it (which I doubt), that you could also convince SC2/WG2 to
do something comparable in 10646 to keep the standards in synch?
Recall that SC2/WG2 has almost *no* concept of character properties --
those are added by the Unicode Standard. Bring in a proposal
that says, We need to add two more planes of private use
characters, with these special properties, because XYZ... and
you'll get a row of blank stares from the national body
representatives.

Finally, assuming that you could get something like this into
the standards, what makes you think that the platform vendors
would complicate and expand their character property tables
to support this speculative scheme? They have the option to
not support all characters in the standard, and a new plane or
two full of PUA characters with a checkerboard of speculative
property assignments strike me as prime candidates for the
kind of stuff they would simply say, We have no interest in
supporting these things.

I think you're spitting into the wind if you think you can
force, through the character standardization process, the
major platform vendors to support the kind of PUA functionality
you are after, when they could do so *today* via much more
extensible and architecturally sensible means given the
existing PUA characters, but have not yet chosen to do so.

--Ken

Re: What is the principle?

From: Ernest Cline [EMAIL PROTECTED]
 I'd have to take the time to list them, but a quick glance convinces
 me that there are at most several hundred combinations that would
 need to be supported if we limit things to just those combinations
 already in use.  (it might take more, if for example all 256 potential
 combining classes were supported instead of the 26 listed in
 UCD.html),  At 128 characters per combination plus more for a
 few that might need them, it should prove possible to handle this
 in 1 or 2 planes.

This seems highly excessive. We already have plenty of PUA space. All what we
need is a standard way (file format? protocol?) to transport PUA character
properties, and possibly encode a reference (URI?) to the definition file or
service. If Unicode does not want to do this job, at least it could participate
in such independant development by commenting about the protocol/format used to
encode these properties (notably to make sure that the system remains extensible
and can encode new properties that may be added later).

This would work in relation with the evolution of the Unicode standard itself
(versioning) which may be handled correctly (however less efficiently) through a
sort of emulation layer that would mimic the behavior of new standardized
characters and properties. I won't expect that every application will be able to
interpret this protocol or implement the emulation layer, but at least it
becomes possible to create less ambiguous interoperable solutions based on other
existing standards (that's why I think that, if such separate development is
created, it should be based on the most advanced interoperability technologies
of today, notably XML and its schemas and namespaces).

You think this is overkill? Well in some near future, I think that it will be
difficult for applications to follow the evolutions of the Unicode standard, and
differences of versions will cause soon a nightmare if there's no more formal
way to specify what is implicitly part of a Unicode version (and does not need a
complex negoctiation of protocol) clearly identified by a identifier resolvable
by online services, and what can be supported the most completely as possible by
an emulation layer. XML schemas, because they are versionnable, can really help
here (notably because of the capability of modern XML parsers to use local
caches for definition data, including local prebuilt-in implementations which
are the most efficient).

So I don't like the idea of adding more PUAs with other defaults. I much favor
some more fredom on the use of PUAs, and a way to make what looks like a
deviation of the standard today, a now conforming solution.

It will become more important with the remaining scripts to encode, simply
because we really lack some resources to be able to produce any standard for
them. What this means is that the evolution of Unicode will soon become
impossible without experimentation and gradual integration with some
interoperable services. With the current standard stability policy, this need is
even more important because further corrections of past errors will become
nearly impossible (and so this will stop any attempt to make significant
evolutions to the standard itself).

It's clear that there are needs for PUAs today, just because Unicode is becoming
an universal standard for more and more applications. If this universal standard
blocks evolution, then others will want to develop indepant standards and there
will be a risk of splits caused by OS vendors themselves.

(see what has happened 15 years ago to Unix, and the high difficulty today to
reunify what was initially a unique standard; thanks GNU and Linux have been the
motors and such reunification, because other proprietary *nix versions are now
converging for interoperability with Linux; but this unification is probably 15
to 20 years before it becomes true, unless *nix vendors decide to abandon
prememtively some dead branches to keep only those that users want and are
ready to learn and support themselves).

Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode

XML has become the de facto standard for fancy text.  It is therefore
useful to explore ways and means of bringing XML into plain text,
since obviously plain text is simpler than, and superior to, fancy text.
The current method involving  and  and  and / and who knows what else
is obviously much too complicated, and cannot interoperate with even the
simplest plain text.  Fortunately, the characters in planes 4 through
B can come to our rescue.

Plane 4 will be divided into mini-blocks of 32 (or perhaps 64) characters.
The Unicode Consortium will allocated these on the usual basis (first come
first served, once and for all, and free) to users for the representation
of start-tags.  For example, supposing that block 4 was allocated to
the W3C HTML WG, we might represent html as U+4, head as U+40001,
body as U+40002, and so on.  In this way, the start-tag (exclusive
of attributes and attribute values) is reduced to a single character.
The last block will not be allocated; U+4FFFC will be used to indicate
the beginning of a comment, and U+4FFFD the beginning of a processing
instruction.

Plane 5 will be automatically assigned in parallel to plane 4 for the
representation of end-tags: thus, U+5 would be /html.  U+5FFFC and
U+5FFFD will have the obvious meanings.

Plane 6 will also be allocated as mini-blocks and used for the
representation of attribute names.  If a Plane 4 character is followed
by a Plane 6 character, then the start-tag has at least one attribute.
The last mini-block will not be allocated; 6FFFD will be used to indicate
that the current tag has no more attributes.

Plane 7 is reserved for future use.

Planes 8 through A are clones of planes 0 through 3 respectively,
and are used to represent attribute value, comment, and processing
instruction text.  In this way, only character content is encoded using
traditional Unicode characters.

It is expected that a secondary market in mini-blocks would eventually
arise.

-- 
But I am the real Strider, fortunately,   John Cowan
he said, looking down at them with his face [EMAIL PROTECTED]
softened by a sudden smile.  I am Aragorn son  http://www.ccil.org/~/cowan
of Arathorn, and if by life or death I can  http://www.reutershealth.com
save you, I will.  --LotR Book I Chapter 10

Re: What is the principle?

2004-03-31 Thread Rick McGowan

Peter Kirk wrote...

 I am undecided yet whether to make a formal proposal.
 Ken seems to suggest that this would be a waste of time -

Yes. I also think it would be a waste of time, but...

 although I can see some advantages in obtaining a formal rejection.

... I can also see some value in a formal rejection of any notion to  
subdivide the PUA into different property areas or make new PUAs with other  
properties.

Rick

Re: What is the principle?

On 31/03/2004 12:28, Ernest Cline wrote:

 

...

This is the kind of stuff the UTC refuses to start up by trying
to provide some subdivision of semantics in the PUA. *That* is
the principle, by the way, which guides the UTC position on
the PUA: Use at your own risk, by private agreement.
   

Which is why if any private use characters with default characteristics
other than those of the existing Private Use blocks are ever to be part of
Unicode they will need to be added as additional Private Use blocks,
not by redefining existing PUA's
There are currently some 10 totally unused planes, with not even any
tentative plans for them,  Allocating one or two those into additional
Private Use Areas with a variety of default characteristics instead of
the monotonous default characteristics of the existing Private Use
Areas should not prove too difficult.  For example, 26 blocks of 128
Private Use Combining Marks each, each block corresponding to
one of the existing canonical combining classes (with perhaps a
larger block for class 0) would amply satisfy the needs of most
private use scripts for combining marks. Similarly, blocks for
additional characters that would have other properties should
be simple to define and for most combinations of property values,
128 characters should also prove to be exceedingly ample
I'd have to take the time to list them, but a quick glance convinces
me that there are at most several hundred combinations that would
need to be supported if we limit things to just those combinations
already in use.  (it might take more, if for example all 256 potential
combining classes were supported instead of the 26 listed in
UCD.html),  At 128 characters per combination plus more for a
few that might need them, it should prove possible to handle this
in 1 or 2 planes.






 

Ernest, I support your general ideas here. But I am concerned about the 
implications of defining PUA characters with combining classes other 
than zero. I can see this causing some confusion with normalisation etc. 
And it does hugely multiply the number of PUA characters required.

Let's think when one might need PUA characters with cc0. The relevant 
cases are all like B, M1, M2, where B is a base character and M1 and 
M2 are combining characters, one or both of them in your proposed 
extended PUA. And cc0 is required only if you want this sequence to be 
canonically equivalent to B, M2, M1, and so want one of these to be 
converted to the other during normalisation - a reordering which can 
only happen if M1 and M2 both have cc0 (and different).

Is it really necessary to support to this level of detail the concept of 
canonical equivalence of PUA sequences? Would it not be enough for those 
specifying the PUA characters to specify one of the orderings as correct 
and the other as a spelling error? I really can't see this requirement 
being widespread enough to justify defining the thousands of PUA 
characters with different combining classes which you propose.

My proposal would rather be for a single group of PUA combining marks 
which all have cc=0, and are all default ignorable, with the result 
that they are not displayed when a regular font is selected. These could 
be used for non-standardised diacritics, mark-up (I mean this in the 
old-fashioned sense of marks added to the text rather than as a way of 
specifying formatting etc) etc, and also in effect as variation 
selectors if the private font specifies pseudo-digraphs. I don't know 
exactly how many might be required, but I am thinking tens or hundreds 
rather than thousands.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode

2004-03-31 Thread Rick McGowan

Oops.
Well...
*That* was a day early.

Rick

RE: What is the principle?

2004-03-31 Thread D. Starner

Mike Ayers [EMAIL PROTECTED] writes: 
 
   Support?  ROFL!  Call up one of those companies and tell them that 
 you are having trouble displaying PUA fonts, eastern or otherwise.  I'd like 
 to snoop on that call. 
 
Apple seemed pretty concerned about displaying PUA fonts on Mac OS X  
recently on this mail list. Personally, I doubt if I could get Microsoft 
to care if Windows or Word was causing my monitor to spin around and 
spit out pea soup, but if, say, Xerox was having trouble displaying the 
correct spelling of its directors' names and mentioned that they might 
have to go to Open Office for this, I'm sure Microsoft would find it 
quite important. It has nothing to do with the PUA; it has to do with 
whose complaining and how much weight they carry. 
  
  This is the kind of stuff the UTC refuses to start up by trying 
  to provide some subdivision of semantics in the PUA. *That* is 
  the principle, by the way, which guides the UTC position on 
  the PUA: Use at your own risk, by private agreement. 
  
   ...and quit bothering us about it.  That's gotta be in there 
 somewhere.  If not, I have an amendment to propose. 
 
Why don't we add that note to other blocks? It'd be so much easier 
if we could just tell the people using, say, the Hebrew block that 
we've thrown something together for you, don't bother us if it doesn't 
work. Surely Unicode didn't waste two planes for something that 
no one can practically use. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

Peter Kirk scripsit:

 But, as Ken has just clarified, with NBSP Louis' neck may be stretched 
 rather uncomfortably, if not cut completely. Here is what I don't want 
 to see (fixed width font required):
 
 Louis   XVI   was
 guillotinedin
 1793.

This, however, is a matter of presentation rather than semantics, and as such
fitly belongs in the realm of presentational markup.  In HTML, one might
specify ttnbsp;/tt to generate a fixed-width space.

-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, LOTR:FOTR

Re: What is the principle?

2004-03-31 Thread Mark Davis

While I disagree with most of what you've said on this list, it is not an
unreasonable proposal to change the default properties for some ranges of the
private use blocks. I don't think that this would, in practice, really disturb
any applications, because of #1 below.

I have, however, a few observations.

1. PUA properties, as is clear from Ken's excellent descriptions, are simply
defaults. With the exception of normalization, no Unicode implementation is
required to observe them. So even if this change is made, any conformant
implementation is free to simply ignore it and just assign its own properties.
This would not be a magic wand.

2. Unicode properties are not sufficient for rendering. With technologies such
as Apples, all of the other work can be done in a font. With OpenType, most but
not all can -- in particular, reordering has to be done by the application/OS.
So complex scripts that require reordering still would not be interchangeable
without private agreement.

3. Even excluding the normalization properties and other obvious inapplicable
properties (such as name or age), there are some 50-odd possible character
properties, many of them with multiple possible values: see

http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt
http://www.unicode.org/Public/UNIDATA/UCD.html#Properties
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt

A concrete proposal would have to specify exactly which properties were
relevant, and what the values are for the proposed ranges. (Clearly an even
partition according to all the possible combinations would be completely
impractical.) If the goal is rendering, this means looking at the possible
combinations of properties that are relevant for rendering and proposing a
division that makes sense.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Rick McGowan [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wed, 2004 Mar 31 16:24
Subject: Re: What is the principle?


 On 31/03/2004 12:40, Rick McGowan wrote:

 Peter Kirk wrote...
 
 
 
 ... I have a real requirement. The UTC has the power to meet my requirement,
 and to do so rather simply. I am asking them to meet it.
 
 
 
 Actually, you are not asking UTC anything. You are discussing the PUA on a
 public-access mail list. There's a big difference. This *is* the place to
 discuss as you are doing, and a good place to formulate your positions for
 eventual submission of a proposal, if any.
 
 

 Thanks for the clarification. I was aware of the distinction, and was
 using am asking loosely. I am undecided yet whether to make a formal
 proposal. Ken seems to suggest that this would be a waste of time -
 although I can see some advantages in obtaining a formal rejection. I
 wonder if anyone else on the UTC or associated with it might give some
 hope for such a proposal?


 -- 
 Peter Kirk
 [EMAIL PROTECTED] (personal)
 [EMAIL PROTECTED] (work)
 http://www.qaya.org/

Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode

From: [EMAIL PROTECTED]


 XML has become the de facto standard for fancy text.  It is therefore
 useful to explore ways and means of bringing XML into plain text,
 since obviously plain text is simpler than, and superior to, fancy text.
 The current method involving  and  and  and / and who knows what else
 is obviously much too complicated, and cannot interoperate with even the
 simplest plain text.  Fortunately, the characters in planes 4 through
 B can come to our rescue.

Is it a joke? XML does not need it, and if something is expected it's possibly a
structured binary representation of XML for easier processing with simplicied
parsers.

Your proposal also ignores the current developement of XML with namespaces and
parsed entities resolved according to interchangeable schemas... (these last can
be described in a structured binary representation, by respecting the XML
document InfoSet, but until there's a need and formal definition for use in XML
RPC services, I see no usage for such obscure encoding in planes 4 to 11
(because such encoding is unparsable, notably for document validation and
reusability with non colliding namespaces)... Thanks the current text-based XML
syntax offers much more flexibility than what you propose here, which could lead
to new flaws (probably even more than with the current syntax supported by
existing XML parser implementations), including at the security level (think
about how you would break XML Signatures)...

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread D. Starner

Peter Kirk wrote: 

Louis   XVI   was 
guillotinedin 
1793. 
 
Here is what I do want: 
 
Louis XVI was 
guillotinedin 
1793. 
Louis\ XVI was guillotined in 1793. If you aren't using TeX, 
and you're doing this type of justification in small columns, 
your program ought to provide a way to do this. This is approaching 
italics or small capitals; it's necessary to look right, but it's 
not plain text. 
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Wasting Planes (was: RE: What is the principle?)

 Surely Unicode didn't waste two planes for something that 
 no one can practically use. 

Plane 15 and Plane 16 private use characters weren't the 
invention of the UTC, by the way. They derive from the
original specification of ISO/IEC 10646-1. From 
ISO/IEC 10646-1: 1993:

The code positions of 32 planes from Plane E0 to Plane FF
of Group 00 shall be for Private Use.

The code positions of the 32 groups from Group 60 to Group 7F
shall be for Private Use.

That would have been:

   U-00E0..U-00FD
   U-6000..U-7FFD
   
That was 8224 *planes* of private use code positions.

Amendment 1 (the one that defined UTF-16) amended that to
read:

The code positions of the 32 groups from Group 60 to
Group 7F shall be for private use.

The code positions of Plane 0F and Plane 10, and of the
32 planes from Plane E0 to Plane FF, of Group 00 shall
be for private use.

The 6400 code positions E000 to F8FF of the Basic
Multilingual Plane shall be for private use.

That was 8226 *planes* of private use code positions,
besides the 6400 code positions on the BMP (which had
been defined earlier, but not spelled out in the same
clause with the rest of the private use allocation).
The addition of Plane 0F and Plane 10 was so there were
some private use planes accessible via UTF-16.

In that grand proliferation of wastage, 10646 allowed for
539,089,084 private use code positions. That was a wee
tad more than anyone actually needed to use, by the way.

More recent amendments to 10646 have simply settled on
the principle that *all* code positions beyond U-0010
are reserved, leaving the 6400 private use code positions
on the BMP, plus Plane 0F and Plane 10. In the grand scheme
of things, that seems to be the Goldilocks solution -- not
too small (6400) and not too big (539,089,084) -- but jst 
right (137,468).

There are people who have valid reasons for making use
of Plane 0F or Plane 10 private use characters, by the
way, but most of those reasons have to do with CJK. And
the reason for that should be pretty obvious -- only the
CJK script deals with the kind of entity numbers (multiple
10's of thousands) that make the 6400 code points of
the BMP PUA seem cramped. *Any* other unencoded script,
for example, with the possible exceptions of Egyptian
hieroglyphics or Tangut ideographs, would fit into the
BMP PUA with plenty of room to spare.

--Ken

Re: What is the principle?




 [Original Message]
 From: Peter Kirk [EMAIL PROTECTED]

 Ernest, I support your general ideas here. But I am concerned about the 
 implications of defining PUA characters with combining classes other 
 than zero. I can see this causing some confusion with normalisation etc. 
 And it does hugely multiply the number of PUA characters required.
 
 snip
 
 Is it really necessary to support to this level of detail the concept of 
 canonical equivalence of PUA sequences? 

If you want them to be able to interact with the existing combining marks
then any proposal for more specific private use characters will need to
include combining characters for every existing combining class.  128
characters per class may prove to be overly generous, but it serves
as a starting point for discussion.  The number was chosen because
of the stated preference of assigning character blocks that line up
in groups of 128.  A detailed proposal would definitely need to examine
existing scripts as it would be wasteful to assign too many yet pointless
to assign too few.  I can't see any useful proposal for more specific
Private Use characters as using less than half a plane.  Any proposal
that uses more than one plane will need a lot of justifying to have any
chance, and even with ten unspoken planes out there, Any proposal
that would call for more than two planes will not go anywhere.

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

On 31/03/2004 14:25, [EMAIL PROTECTED] wrote:

Peter Kirk scripsit:

 

But, as Ken has just clarified, with NBSP Louis' neck may be stretched 
rather uncomfortably, if not cut completely. Here is what I don't want 
to see (fixed width font required):

Louis   XVI   was
guillotinedin
1793.
   

This, however, is a matter of presentation rather than semantics, and as such
fitly belongs in the realm of presentational markup.  In HTML, one might
specify ttnbsp;/tt to generate a fixed-width space.
 

I disagree. Surely there is something SEMANTICALLY different about the 
space in Louis XVI. One semantic difference is that it is 
non-breaking. But another one is that these words should not be split 
apart. An additional semantic distinction might be that they should be 
treated as one word for the purposes of word breaking algorithms.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

On 31/03/2004 13:30, Kenneth Whistler wrote:

...

I think you're spitting into the wind if you think you can
force, through the character standardization process, the
major platform vendors to support the kind of PUA functionality
you are after, when they could do so *today* via much more
extensible and architecturally sensible means given the
existing PUA characters, but have not yet chosen to do so.
--Ken

 

Ken, I take many of your points. But in this last paragraph you are 
comparing two very different things.

If Ernest's proposal were accepted, major platform vendors could 
(although that does not necessarily imply that they would) implement it 
rather simply by updating the tables of character properties within 
their systems. Indeed I would expect such tables to be updated more or 
less automatically by some process of importing and compiling the 
Unicode character database (including the default properties for the PUA).

That is a far easier task than the one which they have not yet chosen 
to do, to support tables of character properties within PUA fonts, 
because this latter requires significant software development effort and 
may not fit well within existing system architecture.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)



  Here is what I do want: 
   
  Louis XVI was 
  guillotinedin 
  1793. 
  
 Louis\ XVI was guillotined in 1793. If you aren't using TeX, 
 and you're doing this type of justification in small columns, 
 your program ought to provide a way to do this. 

Other possible approaches that any industrial-strength
typesetting program ought to provide:

A. Select Louis XVI. Set 'Keep together on line' as a property
   to prevent inappropriate line breaking. Set 'Prevent inter-word
   space justification' to prevent the justification algorithm
   from adjusting the space width beyond the value provided by
   the SPACE in the font.
   
B. Select Louis XVI. Enter it into the hyphenation and line
   breaking dictionary used by the program and set appropriate
   properties on the entry in the dictionary.
   
C. Simply select the space in the text and set it to
   'no-break', 'no-adjust'.
   
Any of these alternatives could be implemented with just a
simple U+0020 SPACE character sitting in the text itself.

That is in addition to solutions that make use of actual
fixed-width space character codes surrounded by ZWNBSP characters
to prevent line breaking.

The point is that looking to encode a special character in
Unicode for every distinct visual effect in typesetting is
not necessarily the first, best solution to settle on. It
might not even be seventh or eighth best on the possible
list of alternative approaches to solve the problem.

--Ken

Louis XVI  was
guillotined on
Jan.   21,   1793,
facing death  with
courage.

Re: Unicode 4.0.1 Released

Marco Cimarosti scripsit:

 So far, my understanding was that the normative properties of existing code
 points where carved in stone.

Not all normative properties are immutable.  A normative property is
simply one which you have to get right if you claim conformance to
that part of Unicode:  you cannot make PLUS SIGN a letter.  Immutable
properties are those which Unicode guarantees will never change; they
are a subset of the normative properties.

 Won't these fixes break applications out there? I.e., won't they turn
 previously conformant applications into non conformant ones?

They will conform to previous versions but not to newer versions.

-- 
BALIN FUNDINUL  UZBAD KHAZADDUMU[EMAIL PROTECTED]
BALIN SON OF FUNDIN LORD OF KHAZAD-DUM  http://www.ccil.org/~cowan

Arabic Shaping Classes


Well, I've decided to start what is probably a quixotic quest for
a better set of private use characters.  Such a proposal will need
to be complete, but it had best be as simple as possible.  That
leads me to my first question.  Where is the Arabic Shaping
Class property normally taken care of?  Is this something that is
normally handled inside a font, and as such, something that can
be left with a generic default, or is this something which would
require explicitly setting aside ranges for each shaping class.
(With 54 existing shaping classes, I sincerely hope that not
having to set ranges for each class will prove feasible.)

Ernest Cline
[EMAIL PROTECTED]

Re: What is the principle?

On 31/03/2004 14:27, Mark Davis wrote:

While I disagree with most of what you've said on this list, it is not an
unreasonable proposal to change the default properties for some ranges of the
private use blocks. I don't think that this would, in practice, really disturb
any applications, because of #1 below.
I have, however, a few observations.

1. PUA properties, as is clear from Ken's excellent descriptions, are simply
defaults. With the exception of normalization, no Unicode implementation is
required to observe them. So even if this change is made, any conformant
implementation is free to simply ignore it and just assign its own properties.
This would not be a magic wand.
 

Understood. But I was rather thinking that at least some implementations 
base their character properties directly on the Unicode character 
database. Isn't this what ICU does? And so, if the PUA default 
properties are the ones in the UCD, they would automatically be used by 
implementations.

2. Unicode properties are not sufficient for rendering. With technologies such
as Apples, all of the other work can be done in a font. With OpenType, most but
not all can -- in particular, reordering has to be done by the application/OS.
So complex scripts that require reordering still would not be interchangeable
without private agreement.
 

This is why the suggestions made for storing character properties in the 
font are unrealistic; they require major restructuring of system 
software (close to rewriting the whole OS, as I wrote earlier), not just 
tinkering. I accept that there may be some practical limitations on PUA 
complex scripts, but I would like them to be a lot less than they are now.

3. Even excluding the normalization properties and other obvious inapplicable
properties (such as name or age), there are some 50-odd possible character
properties, many of them with multiple possible values: see
http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt
http://www.unicode.org/Public/UNIDATA/UCD.html#Properties
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
A concrete proposal would have to specify exactly which properties were
relevant, and what the values are for the proposed ranges. (Clearly an even
partition according to all the possible combinations would be completely
impractical.) If the goal is rendering, this means looking at the possible
combinations of properties that are relevant for rendering and proposing a
division that makes sense.
 

That is why I (rather than Ernest) have discussed only rendering related 
properties like bidi and default ignorable. I realise that there may be 
other properties which need to be considered, but I am not yet sure 
which these are.

I sense that you prefer to change the default properties of existing PUA 
characters rather than add new ones. Might it be sensible to adjust the 
properties in one of the PUA planes but leave the other one untouched? 
Has ANYONE actually defined characters in one or other of these planes, 
and if so, which? It would make more sense to change the default 
properties of a plane which no one is actually using.

Mark
__
http://www.macchiato.com
  
 



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: What is the principle?

On 31/03/2004 15:32, Ernest Cline wrote:

 

[Original Message]
From: Peter Kirk [EMAIL PROTECTED]
Ernest, I support your general ideas here. But I am concerned about the 
implications of defining PUA characters with combining classes other 
than zero. I can see this causing some confusion with normalisation etc. 
And it does hugely multiply the number of PUA characters required.

snip

Is it really necessary to support to this level of detail the concept of 
canonical equivalence of PUA sequences? 
   

If you want them to be able to interact with the existing combining marks
then any proposal for more specific private use characters will need to
include combining characters for every existing combining class. ...
I don't see it. If the PUA combining marks have cc=0, they can never be 
reordered. As long as other marks are always written in canonical order, 
they will in practice never be moved relative to other marks.

Perhaps you are thinking of a sequence something like B, M1, M2, M3 in 
which M1 and M2 interact typographically, but M1 is PUA and M2 and M3 
are not. The normal Unicode rule would be that cc(M1)=cc(M2). But this 
is no guarantee against a reordering to B, M1, M3, M2, which is still 
canonically equivalent but M1 and M2 have been separated. If instead 
cc(M1)=0cc(M2)cc(M3), B, M1, M3, M2 is still canonically equivalent 
but with M1 and M2 separated, but the situation is no worse than by the 
normal Unicode rule.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Unicode 4.0.1 Released


  * Changed: bidi class of several characters

 Won't these fixes break applications out there? I.e., won't they turn
 previously conformant applications into non conformant ones?

And the other thing to understand about this particular change
is that it is the outcome of a years-long debate and a
painstakingly negotiated settlement that reminded me of
other difficult negotiations involving Middle Eastern issues.

The upshot will be that Microsoft-based applications will
come *into* compliance with the Bidirectional Algorithm,
with the crucial few character property changes as stated.
And IBM and other vendors who had implemented based on the
prior property values agreed that it was worth the tweak
in order to bring the entire industry into an agreed,
interoperable state for bidirectional behavior (in the
absence of higher-level protocols) involving the crucial
ASCII characters, '+', '-', and '/', that interact with URL's,
dates, times, and numbers in a bidirectional context.

--Ken

Re: What is the principle?

2004-03-31 Thread Mark Davis

comments below.

Mark
__
http://www.macchiato.com

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wed, 2004 Mar 31 19:15
Subject: Re: What is the principle?

 On 31/03/2004 14:27, Mark Davis wrote:

 While I disagree with most of what you've said on this list, it is not an
 unreasonable proposal to change the default properties for some ranges of the
 private use blocks. I don't think that this would, in practice, really
disturb
 any applications, because of #1 below.

 I have, however, a few observations.

 1. PUA properties, as is clear from Ken's excellent descriptions, are simply
 defaults. With the exception of normalization, no Unicode implementation is
 required to observe them. So even if this change is made, any conformant
 implementation is free to simply ignore it and just assign its own
properties.
 This would not be a magic wand.

 Understood. But I was rather thinking that at least some implementations
 base their character properties directly on the Unicode character
 database. Isn't this what ICU does? And so, if the PUA default
 properties are the ones in the UCD, they would automatically be used by
 implementations.

Yes, some do (and ICU does pick up the default). Just pointing out that
implementations can freely choose the properties (except normalization).

BTW, you have been mentioning the combining class; you can have combining marks
in the PUA, but they have to have zero combining classes.

 2. Unicode properties are not sufficient for rendering. With technologies
such
 as Apples, all of the other work can be done in a font. With OpenType, most
but
 not all can -- in particular, reordering has to be done by the
application/OS.
 So complex scripts that require reordering still would not be interchangeable
 without private agreement.

 This is why the suggestions made for storing character properties in the
 font are unrealistic; they require major restructuring of system
 software (close to rewriting the whole OS, as I wrote earlier), not just
 tinkering. I accept that there may be some practical limitations on PUA
 complex scripts, but I would like them to be a lot less than they are now.

ANY dynamic reassignment of properties requires a major overhaul. There have
been proposals over the years for exchange of PU property data. All of them have
died, and I never expect to see any succeed.

The reason is that most implementations just get properties with static calls,
e.g. isLetter(x). To change it to be dynamic, all of these calls in all programs
would have to be changed to reference a dynamic collection of properties. In a
single-threaded world, this wouldn't be too bad. But that is not our world -- 
which is a multi-threaded world -- there it is nasty; and horrible if the same
document is expected to contain different sets of PU properties. There are also
performance implications, since properties are used so heavily in processing.

These are not whims of software vendors; they would be very expensive retrofits
for essentially no benefit.

 3. Even excluding the normalization properties and other obvious inapplicable
 properties (such as name or age), there are some 50-odd possible character
 properties, many of them with multiple possible values: see

 http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt
 http://www.unicode.org/Public/UNIDATA/UCD.html#Properties
 http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt

 A concrete proposal would have to specify exactly which properties were
 relevant, and what the values are for the proposed ranges. (Clearly an even
 partition according to all the possible combinations would be completely
 impractical.) If the goal is rendering, this means looking at the possible
 combinations of properties that are relevant for rendering and proposing a
 division that makes sense.

 That is why I (rather than Ernest) have discussed only rendering related
 properties like bidi and default ignorable. I realise that there may be
 other properties which need to be considered, but I am not yet sure
 which these are.

Those alone won't work. If you want stuff to render right, then you have to
include *any* property that systems may use to affect display. You do want these
characters to linebreak correctly, eh? That's why I said that a complete
proposal would have to spell out all the properties would be considered, and
give reasons for the inclusion/exclusions.

 I sense that you prefer to change the default properties of existing PUA
 characters rather than add new ones. Might it be sensible to adjust the
 properties in one of the PUA planes but leave the other one untouched?
 Has ANYONE actually defined characters in one or other of these planes,
 and if so, which? It would make more sense to change the default
 properties of a plane which no one is actually using.

1. There is no way I would advocate

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread fantasai

Peter Kirk wrote:

On 31/03/2004 14:25, [EMAIL PROTECTED] wrote:

Peter Kirk scripsit:

But, as Ken has just clarified, with NBSP Louis' neck may be 
stretched rather uncomfortably, if not cut completely. Here is what I 
don't want to see (fixed width font required):

Louis   XVI   was
guillotinedin
1793.
This, however, is a matter of presentation rather than semantics, and 
as such fitly belongs in the realm of presentational markup.  In HTML,
 one might specify ttnbsp;/tt to generate a fixed-width space.
I disagree. Surely there is something SEMANTICALLY different about the 
space in Louis XVI. One semantic difference is that it is 
non-breaking. But another one is that these words should not be split 
apart. An additional semantic distinction might be that they should be 
treated as one word for the purposes of word breaking algorithms.
non-breaking and non-stretching are presentational properties, not
semantic ones. They don't change the meaning of the space: it's still
just a space, not a hyphen or the letter g. They don't affect
non-visual media; we don't break lines in spoken speech. Louis XVI
is semantically different from Louis' head because the former is a
bare noun whereas the latter is a noun phrase, but as far as the reader
is concerned, they're both separated with a space. Whether the space
breaks or not or stretches or not has no effect on either the meaning
or correctness of the text. It only affects its (visual) aesthetic
quality.
~fantasai

--
http://fantasai.inkedblade.net/contact

Re: What is the principle?

2004-03-31 Thread Mark E. Shoulson

[Original Message]
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Scenario: The UTC listens to you and defines some section of the PUA
as strong right-to-left by default for use in PUA-defined bidirectional
scripts. Somebody else is *already* using that section of the PUA
for something else. Now they have an interoperability problem,
because the default behavior they were depending on changes over
in some future version of some software, not under their control,
and they data gets munged by bidi.
So?  Let *them* fix *their* software.  They should know, same as the 
rest of us, that you can't depend on the PUA.  If they wanted LTR base 
glyphs, then they should have coded that into their system.  Didn't 
someone just say that the normative properties of characters are not 
necessarily etched in stone?  If that's true of anything, it should be 
of the PUA.

~mark

Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode

2004-03-31 Thread Mark E. Shoulson

[EMAIL PROTECTED] wrote:

XML has become the de facto standard for fancy text.  It is therefore
useful to explore ways and means of bringing XML into plain text,
since obviously plain text is simpler than, and superior to, fancy text.
The current method involving  and  and  and / and who knows what else
is obviously much too complicated, and cannot interoperate with even the
simplest plain text.  Fortunately, the characters in planes 4 through
B can come to our rescue.
Heh... I've occasionally caught myself almost wishing for this kind of 
setup, ridiculous though it be.  It would be nice to be able to get just 
the *content* of the text without having to bother with all that mucking 
about with HTML rendering engines and whatnot.

I suppose only a programmer (and a semi-Luddite one at that, who won't 
or can't use existing packages) would really care, though.

Now, if we can just simplify ASCII down to *one* character and some 
variation selectors...

~mark

Line Break class of U+FE51 Small Ideographic Comma