RE: Definitions

2003-11-26 Thread Peter Constable
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Peter Kirk

> a sequence of combining characters
> following ZWNJ is a defective combining sequence.

For now, yes. This may change.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division




Re: Definitions

2003-11-26 Thread Peter Kirk
On 26/11/2003 07:43, [EMAIL PROTECTED] wrote:

...

In all I would rather ban all defective sequences, by enforcing the W3C 
character model. I dont' see much point for them. The only possible reason I 
can think of right now is to allow description of the character itself, though 
that would possibly best be done through an element that represents the concept 
of a Unicode character along the lines of .
 

Formally, as we discovered recently on this list, it is necessary to use 
defective combining sequences to write some Khmer words correctly as 
defined in TUS 4.0 p.282; for a sequence of combining characters 
following ZWNJ is a defective combining sequence.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: Definitions

2003-11-26 Thread jon
> In all I would rather ban all defective sequences, by enforcing the W3C 
> character model.

rect: by enforcing the use of full normalisation as defined in the W3C 
character model.



RE: Definitions

2003-11-26 Thread jon
Quoting Philippe Verdy <[EMAIL PROTECTED]>:

> Peter Kirk [mailto:[EMAIL PROTECTED] writes:
> > Why is this a problem? Quotes and ">" with combining marks are 
> > presumably not legal HTML or XML;
> 
> You're wrong: it is legal in both HTML and XML. What is not specified
> correctly is the behavior of HTML and XML parsers face to a XML or HTML
> document claiming it is coded with a Unicode encoding scheme or any other
> Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
> contains supplementary characters that are not part of the Unicode/ISO/IEC
> 10646 repertoire).
> 
> > and so the interpretation of a quotes 
> > or ">" followed by combining marks as a quote or ">" and a defective 
> > combining sequence is unambiguous, surely?
> 
> No it is not: there's a problem of prevalence between XML/HTML/SGML parsing
> rules, and Unicode parsing rules. Using character entities can solve this
> problem, but I would really prefer that the W3 accepts a modification of its
> parsing rules so that any text element or attribute value starting by a
> defective combining sequence MUST NOT be interpreted as such using the
> simple encoding scheme.

The Character Model defines degrees of normalisation of text which go beyond 
NFC to prohibit the sequences you describe. Standards can use these definitions 
to prevent the issues associated with them.

> > Your proposed solution to the problem is messy in requiring the use of 
> > numeric entities, and unnecessary.
> 
> This is not that messy. Also I did not say that numeric entities must be
> used. Any parsed named entity could be used as well. This is not a problem
> of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML
> standards. For SGML and HTML up to 4.01, you also have problems with the
> equal sign (because the quotes around element's attribute values are not
> mandatory, unlike in XML).

It is messy, because it would have to occur on serialisation from a model of an 
XML document which hid the use of entities. Hence if we parsed 
/ where / expanded to the single character U+0338 
followed by the text " is a reverse solidus character" then we might have that 
stored as a text node of that character, receive it as a text event of that 
character, etc. in expanded form.

On serialisation we would have to serialise as ̸ is a reverse 
solidus character which would be relatively difficult to produce, 
though considerably easier to produce than the original /

Of course in this case it's more crucial than in others (since inserting the 
character directly into the stream and then normalising it with NFC would 
produce and non-well formed document, which isn't true with other combining 
characters).

In all I would rather ban all defective sequences, by enforcing the W3C 
character model. I dont' see much point for them. The only possible reason I 
can think of right now is to allow description of the character itself, though 
that would possibly best be done through an element that represents the concept 
of a Unicode character along the lines of .



Re: Definitions

2003-11-26 Thread Peter Kirk
On 26/11/2003 06:17, Philippe Verdy wrote:

Peter Kirk [mailto:[EMAIL PROTECTED] writes:
 

Why is this a problem? Quotes and ">" with combining marks are 
presumably not legal HTML or XML;
   

You're wrong: it is legal in both HTML and XML. What is not specified
correctly is the behavior of HTML and XML parsers face to a XML or HTML
document claiming it is coded with a Unicode encoding scheme or any other
Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
contains supplementary characters that are not part of the Unicode/ISO/IEC
10646 repertoire).
 

OK, I used the wrong words here. A sequence of a quote or ">" followed 
by combining characters is legal HTML/XML with the interpretation of a 
quote or ">" introducing a quoted string or terminating a tag, followed 
by a defective combining sequence which is part of the quoted string or 
of the text following the tag. The question is, does such a sequence 
have any other legal interpretation, within the context of an HTML/XML 
tag? If not, there is no ambiguity.

...

There could of course be 
problems if there were any precomposed combinations of quotes or ">" 
with combining characters, but I don't think there are any, are there?
   

There are such precomposed sequences in Unicode. Look in
NormalizationTest.txt for the places where ">", single and double quotes are
used and part of a combining sequence... Notably look at sequences made with
the combining solidus overlay; add also the case of enclosing combining
characters, and of mathematical operators that can be created with a
combining sequence starting by ">" or "=" or single or double quotes, and
modified by diacritics.
 

According to John Cowan there is just one such precomposed character, 
U+226F. As an HTML/XML document (the whole file, not just the parts 
between tags) is a Unicode string, the Unicode conformance rules would 
seem to mandate that an HTML/XML parser should parse U+226F exactly as 
if it were the sequence <">", U+0338>, i.e. as end of tag followed by a 
defective combining sequence. Normalisation stability implies that this 
precomposed character will always be the only such problem case, at 
least apart from composition exceptions, and so it is possible to write 
it into parsers as a special case. A bit messy, but less messy than 
using numeric entities or named entities.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Definitions

2003-11-26 Thread John Cowan
Peter Kirk scripsit:

> There could of course be 
> problems if there were any precomposed combinations of quotes or ">" 
> with combining characters, but I don't think there are any, are there?

Just one:  U+226F NOT GREATER THAN is canonically equivalent to > followed
by U+0338 COMBINING LONG SOLIDUS OVERLAY.  Consequently, applying NF(K)C
to XML/HTML that contains this sequence such that the > is a closing
tag delimiter produces ill-formed XML/HTML.  This is considered to be
a corner case, because there is typically no reason to separate U+0338
from whatever it applies to by markup.

This problem is called out in the W3C Character Model draft.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
If a soldier is asked why he kills people who have done him no harm, or a
terrorist why he kills innocent people with his bombs, they can always
reply that war has been declared, and there are no innocent people in an
enemy country in wartime.  The answer is psychotic, but it is the answer
that humanity has given to every act of aggression in history.  --Northrop Frye



RE: Definitions

2003-11-26 Thread Philippe Verdy
Peter Kirk [mailto:[EMAIL PROTECTED] writes:
> Why is this a problem? Quotes and ">" with combining marks are 
> presumably not legal HTML or XML;

You're wrong: it is legal in both HTML and XML. What is not specified
correctly is the behavior of HTML and XML parsers face to a XML or HTML
document claiming it is coded with a Unicode encoding scheme or any other
Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
contains supplementary characters that are not part of the Unicode/ISO/IEC
10646 repertoire).

> and so the interpretation of a quotes 
> or ">" followed by combining marks as a quote or ">" and a defective 
> combining sequence is unambiguous, surely?

No it is not: there's a problem of prevalence between XML/HTML/SGML parsing
rules, and Unicode parsing rules. Using character entities can solve this
problem, but I would really prefer that the W3 accepts a modification of its
parsing rules so that any text element or attribute value starting by a
defective combining sequence MUST NOT be interpreted as such using the
simple encoding scheme. If a XML document is serialized into a text file
with a encoding scheme, the generated file should (I would prefer "must")
not encoding these defective sequences with the encoding scheme, but with
character references only.

This would allow to use the exactly SAME text parser used in Unicode as the
input for the lexical and grammatical analysis of the XML/HTML/SGML parser.
Within that model, the sequence ">" + combining character would be seen as a
single combining sequence coding a abstract character, that breaks the
syntax of expected end of tags. Same thing for the quotes delimiting the
start of attribute values or for the square bracket delimiting the start of
a CDATA section.

> There could of course be 
> problems if there were any precomposed combinations of quotes or ">" 
> with combining characters, but I don't think there are any, are there?

There are such precomposed sequences in Unicode. Look in
NormalizationTest.txt for the places where ">", single and double quotes are
used and part of a combining sequence... Notably look at sequences made with
the combining solidus overlay; add also the case of enclosing combining
characters, and of mathematical operators that can be created with a
combining sequence starting by ">" or "=" or single or double quotes, and
modified by diacritics.

> Your proposed solution to the problem is messy in requiring the use of 
> numeric entities, and unnecessary.

This is not that messy. Also I did not say that numeric entities must be
used. Any parsed named entity could be used as well. This is not a problem
of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML
standards. For SGML and HTML up to 4.01, you also have problems with the
equal sign (because the quotes around element's attribute values are not
mandatory, unlike in XML).

We don't have this problem for element names or attribute names, because
they must obey a stricter syntax and can't be any arbitrary Unicode string:
these names cannot contain defective combining sequences simply because
combining characters cannot be identifier starts.


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Definitions

2003-11-26 Thread Peter Kirk
On 26/11/2003 02:29, Philippe Verdy wrote:

[EMAIL PROTECTED] wrote:
 

Briefly, it's my opinion that applications which claim to support
and comply with Unicode should not 'step on' Unicode text.  Any
loopholes in the 'letter of the law' which allow applications to
mung or reject Unicode text should be plugged.
   

If this "pluging" request must be done, it should be also the case for HTML
and XML.
For now, combining characters can be encoded directly just after a quote
character (single or double) used to mark the beginning of an attribute
value, or just after a tag-closing ">". HTML and XML parsers will parse
these quotes or superior signs by ignoring the combining sequence, creating
defective sequences, but this is a problem.
...
 

Why is this a problem? Quotes and ">" with combining marks are 
presumably not legal HTML or XML; and so the interpretation of a quotes 
or ">" followed by combining marks as a quote or ">" and a defective 
combining sequence is unambiguous, surely? There could of course be 
problems if there were any precomposed combinations of quotes or ">" 
with combining characters, but I don't think there are any, are there?

Your proposed solution to the problem is messy in requiring the use of 
numeric entities, and unnecessary.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: Definitions

2003-11-26 Thread Philippe Verdy
[EMAIL PROTECTED] wrote:
> Briefly, it's my opinion that applications which claim to support
> and comply with Unicode should not 'step on' Unicode text.  Any
> loopholes in the 'letter of the law' which allow applications to
> mung or reject Unicode text should be plugged.

If this "pluging" request must be done, it should be also the case for HTML
and XML.
For now, combining characters can be encoded directly just after a quote
character (single or double) used to mark the beginning of an attribute
value, or just after a tag-closing ">". HTML and XML parsers will parse
these quotes or superior signs by ignoring the combining sequence, creating
defective sequences, but this is a problem.

My opinion is that HTML and XML parsers should not take the quote and
superior sign isolately without considering the whole combining sequence.
This means that such occurences should be considered as syntax errors. If
one really wants to create a Unicode-compliant XML/HTML document containing
defective sequences, these sequences should be encoded with character
entities...

A XML/HTML code generator that generates a serialized document should then
know the list of combining characters, and encode them with numeric entities
when their use is defective (at the beginning of a CDATA section, or of an
attribute value, or of a text element... This would completely "plug the
hole".

__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

RE: Definitions

2003-11-25 Thread jameskass
.
Peter Constable wrote,

> James: 
>
> > Inside a program, for instance...
> 
> This is *very* faulty logic. ...

Jeepers!

> ... Variable names exist in source code only,
> and have nothing whatsoever to do with the data actually processed.

Exactly.  Variable names are always internal while data may be
external.

> You're also referring to an assigned character in your example, not a
> PUA codepoint. ...
>

Since it was supposed to draw a correlation between "ASCII-conformant"
and Unicode-conformant, an assigned ASCII character was used in the
example.  After all, ASCII didn't have much to offer in the way
of Private Use Areas or unassigned code points.

> A software product could assign every single PUA codepoint to mean some
> kind of formatting instruction, and insert these into the text like
> markup. In that case, a user's PUA characters will be re-interpreted by
> that software as formatting instructions. 

HTML manages to use ASCII characters as formatting mark-up yet
still allows ASCII text to be processed as expected.

Briefly, it's my opinion that applications which claim to support
and comply with Unicode should not 'step on' Unicode text.  Any
loopholes in the 'letter of the law' which allow applications to
mung or reject Unicode text should be plugged.

Best regards,

James Kass
.



Re: Definitions

2003-11-19 Thread Philippe Verdy
From: "Peter Kirk" <[EMAIL PROTECTED]>
> The problem here is surely that the application is conformant even if it
> doesn't claim or admit to supporting only this one character. It can
> print on the box "Now Unicode conformant!" and make this a major
> advertising feature, and no one can do anything about it. Now this is a
> ridiculous example. But a less ridiculous one is that an application or
> rendering system can claim to support Cyrillic, Greek, Hebrew and/or
> Arabic scripts according to Unicode when in fact it supports only small
> subsets (e.g. those required for major modern languages without
> diacritics), and still be conformant. It becomes very difficult for
> those of us who need support for ancient and/or minority languages to
> find conformant software.

Being conformant to Unicode does not mean anything else than not making
false interpretations of conforming data, and not generating non conforming
data when it's claimed that the operations performed are done according to
Unicode.

The need to exhibit Unicode conformance is not enough: users want support
for their languages, whatever which encoding is used. That's not the place
of Unicode, and other standards do exist or can be developped to specify
this conformance requirement.

One example is the set of MES subsets, or other common subsets which are
developed to exhibit a minimum support for several classes of languages. I
do think that it's a place for other ISO standards relevant to each
language, or to the ISO 10646-1 list of subsets.




Re: Definitions

2003-11-19 Thread Philippe Verdy
From: "Peter Constable" <[EMAIL PROTECTED]>
> A software product could assign every single PUA codepoint to mean some
> kind of formatting instruction, and insert these into the text like
> markup. In that case, a user's PUA characters will be re-interpreted by
> that software as formatting instructions. Is that product conformant?
> Yes. Is it useful? Not for that user.

With a very simple transcoder, you could remap all HTML markup and
supplementary end of lines used in markup into 256 PUAs. You would get a
file that contains ALL the HTML markup but still complies to the Unicode
plain-text definition. Rendering it back to HTML would use a reverse filter,
and would create a HTML file without any PUA, so it would be rendered
correctly.

The only problem is that PUAs have no defined rendering, and Unicode does
not specify ranges of PUAs for distinct uses, with distinct but predefined
_default_ character properties:
why isn't there a range for Mn diacritics, a range for ideographic letters
or symbols, and a range for ignorable formatting controls (all of them with
combining class 0). At least it would have allowed applications and renderer
to behave correctly even in the absence of support for those PUAs, by using
a correct _default_ rendering, instead of just displaying narrow white
boxes, or nothing...

I don't know why this would break anything: documents can still use PUAs the
way they want with their own semantic and behavior. But suggesting distinct
ranges for the default behavior would be a real bonus to help applications
adopt a coherent behavior face to unknown or unspecified PUAs.




Re: Definitions

2003-11-19 Thread Peter Kirk
On 18/11/2003 16:02, [EMAIL PROTECTED] wrote:

...

... A conformant application can even display every
character except (say) U+26A0 as a default "not supported" glyph and
still be called conformant. ...
Of course, such apps are not the kind of thing most of us find useful.
   

Indeed.  So, if an application claims to only interpret U+26A0, we
might take this as kind of a "warning sign" that the application is
fairly useless, eh?
 

The problem here is surely that the application is conformant even if it 
doesn't claim or admit to supporting only this one character. It can 
print on the box "Now Unicode conformant!" and make this a major 
advertising feature, and no one can do anything about it. Now this is a 
ridiculous example. But a less ridiculous one is that an application or 
rendering system can claim to support Cyrillic, Greek, Hebrew and/or 
Arabic scripts according to Unicode when in fact it supports only small 
subsets (e.g. those required for major modern languages without 
diacritics), and still be conformant. It becomes very difficult for 
those of us who need support for ancient and/or minority languages to 
find conformant software.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Definitions

2003-11-19 Thread Peter Kirk
On 18/11/2003 17:03, Peter Constable wrote:

...

So, we need to decide: are we going to debate what follows the letter of
the conformance laws, or what is useful?
 

Maybe we should debate how the conformance laws might be tightened up to 
make them correspond better to what is useful.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: Definitions

2003-11-19 Thread Peter Constable
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of [EMAIL PROTECTED]

James: 

> Inside a program, for instance...

This is *very* faulty logic. Variable names exist in source code only,
and have nothing whatsoever to do with the data actually processed.
You're also referring to an assigned character in your example, not a
PUA codepoint. ...

A software product could assign every single PUA codepoint to mean some
kind of formatting instruction, and insert these into the text like
markup. In that case, a user's PUA characters will be re-interpreted by
that software as formatting instructions. Is that product conformant?
Yes. Is it useful? Not for that user.


Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division




RE: Definitions

2003-11-18 Thread jameskass
.
Peter Constable wrote,

> So, we need to decide: are we going to debate what follows the letter of
> the conformance laws, or what is useful?

Since we seem to agree on what is useful, it might be difficult
to debate.

(I had thought we were debating what is the letter of the conformance
laws and drifting a bit into what should be the letter of the 
conformance laws.)

Earlier, Peter Constable wrote,

> It is perfectly acceptable for a conformant application to use every
> single PUA codepoint for its own internal purposes, and to reject
> incoming PUA codepoints or display them with some default "not
> supported" glyph. ...

And, I'd made some kind of response.  Here's another try at it...

Isn't it acceptable for any application to use any byte sequence
internally?

Inside a program, for instance:

 a = 0

The letter "a" can be used as a memory variable.  The programmer has
just re-assigned its value internally.  It is no longer the letter "a", it
is now the number zero.

This variable might be used as a counter for a loop, or, whatever.

Later in the program, there could be an opportunity for the user
to enter a choice with the keyboard.  The screen could look 
something like:

***

 Please enter one of the following choices:

 "a" = Sarasvati gets to run the e-mail list as she pleases.

  - or -

 "b" = Sarasvati gets to run the e-mail list as she pleases.

 Please type the letter "a" or "b":  | |

Press any key to continue...

***

It isn't necessary to release the internal memory variable "a" in order for
the user to externally indicate a choice by typing in the letter "a" on the 
keyboard.

If the memory variable "a" still equals zero, we don't expect the zero
character to display on the screen when the user enters the "a" letter.
We shouldn't expect some kind of default glyph meaning "this code
point is being used internally, so you can't have it."

If the program calls for the display of an external text file, the letter
"a" really needs to appear where it's expected.

Otherwise, I'd not consider the program to be "ASCII-conformant".

Likewise, any application claiming Unicode conformance which mungs
text or distorts displays for any valid Unicode range...
well, you know where I'm going with this

Best regards,

James Kass
.









RE: Definitions

2003-11-18 Thread Peter Constable
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of [EMAIL PROTECTED]


> We agree that it is acceptable for any conformant application to
> use all of the PUA internally...

> > Of course, such apps are not the kind of thing most of us find
useful.
> 
> Indeed.  So, if an application claims to only interpret U+26A0, we
> might take this as kind of a "warning sign" that the application is
> fairly useless, eh?


So, we need to decide: are we going to debate what follows the letter of
the conformance laws, or what is useful?



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division




RE: Definitions

2003-11-18 Thread jameskass
.
Peter Constable wrote,


> It is perfectly acceptable for a conformant application to use every
> single PUA codepoint for its own internal purposes, and to reject
> incoming PUA codepoints or display them with some default "not
> supported" glyph. ...

We agree that it is acceptable for any conformant application to
use all of the PUA internally.  However, if such an application is 
designed to display Unicode text, then it needs to be able to 
distinguish that which is internal from that which isn't (IMO).  

IOW, if the old DOS applications can tell when to implement x04 as
part of a control sequence from when to display a diamond-shaped 
glyph, then modern apps ought to be able to behave likewise.

> ... A conformant application can even display every
> character except (say) U+26A0 as a default "not supported" glyph and
> still be called conformant. ...
>
> Of course, such apps are not the kind of thing most of us find useful.

Indeed.  So, if an application claims to only interpret U+26A0, we
might take this as kind of a "warning sign" that the application is
fairly useless, eh?

Best regards,

James Kass
.



RE: Definitions

2003-11-18 Thread Peter Constable
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Peter Kirk


> If the channel starts messing around with the
> characters sent through it, that is what is non-conformant.

Only if it messes with those characters while claiming it is not.

Folks, there is nothing non-conformant about processes that convert
"abc" into "ABC" or "day" into "jour" as long as they are doing what
their owners claim they are doing.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division




RE: Definitions

2003-11-18 Thread Peter Constable
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of [EMAIL PROTECTED]


> Any application which bans or prevents the interchange or storage
> of PUA code points should be considered non conformant.

I agree entirely with your ultimate intent, but I must say you are
wrong. 

It is perfectly acceptable for a conformant application to use every
single PUA codepoint for its own internal purposes, and to reject
incoming PUA codepoints or display them with some default "not
supported" glyph. A conformant application can even display every
character except (say) U+26A0 as a default "not supported" glyph and
still be called conformant. The kinds of things a conformant app cannot
do is to take in (say) U+0047 "G" and display it as "F" or pass it on as
U+0042 "B" while claiming it is not transforming that text.

Of course, such apps are not the kind of thing most of us find useful.
And it would certainly be A Good Thing if applications did not hinder
(or even took steps to facilitate) the use of PUA characters according
to semantics defined by a given user.


Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division





Re: Definitions

2003-11-14 Thread Jim Allan
James Kass wrote:

In TrueType/OpenType, the first glyph in the font is used
as the "missing glyph".
And for Postscript fonts systems often take the bullet character in the 
font as the "missing glyph" symbol.

However printers are very erratic in what they do.

Whether a character like ASCII 0x01 (probably in a data file in error or 
by mistransliteration) will be rendered by a printer as a missing glyph 
symbol or space seems to depend on which printer driver is used and 
often on settings within that driver.

Jim Allan









Re: Definitions

2003-11-14 Thread Philippe Verdy
From: <[EMAIL PROTECTED]>
> Philippe Verdy wrote,
> The font specs strongly recommend that font developers use the
> narrow white box, or somthing very similar, for the missing glyph.
> But, some developers

Including Microsoft itself in some of its fonts installed with Windows XP...
Look at Symbol, David, Miriam, Terminal...

> do make up some interesting alternatives,
> and, especially in older fonts with "custom" ("hack") encodings,

The above fonts, and many others in the supplementary fonts coming with
Office, are not "hacks" but they often display the "?" glyph...

> the developer might not have known that the first glyph in the
> font would be used as the missing glyph.  (Or the developer might
> have known but disregarded the recommendations.)

My opinion is that this is an interaction between the legacy support for
Windows 3.x/95 with .FON bitmap fonts. When a glyph is not found in the
TrueType font, then Windows may look into the .FON and finds a glyph there
for "?"...

There are other combinations also coming from Printer settings (which are
used in WYSIWYG documents that try to map immediately the display font to
its corresponding printer font).




Re: Definitions

2003-11-14 Thread jameskass
.
Philippe Verdy wrote,

> But there are several fonts in Windows and Office that still display a
> normal question mark for this glyph ID, instead of a narrow white box as
> expected (this may be a caveat within the system compatibility font mappings
> with system fonts which are not TrueType but simple .FON bitmap fonts)...
> 

Like Peter Kirk mentioned, this can be a code page issue.

In TrueType/OpenType, the first glyph in the font is used
as the "missing glyph".  So, if the font maker put the question
mark as the first glyph, it would be the "missing glyph" for that
font.  But, this would be a rare case, AFAICT.

The font specs strongly recommend that font developers use the
narrow white box, or somthing very similar, for the missing glyph.
But, some developers do make up some interesting alternatives,
and, especially in older fonts with "custom" ("hack") encodings,
the developer might not have known that the first glyph in the
font would be used as the missing glyph.  (Or the developer might
have known but disregarded the recommendations.)

Best regards,

James Kass
.



Re: Definitions

2003-11-14 Thread jon
>  - -
> |  User   |   |  User   |
>  - -
> |   App   |   |   App   |
>  - -
> | Unicode |   | Unicode |
>  ---
> | Communication channel |
>  ---
> 
> In this model, Unicode ... Unicode offers as defined a transparent 
> channel for all characters including PUA (although normalisation etc is 
> permitted), and if an implementation is not transparent it is 
> non-conformant. The communicating applications built on top of Unicode 
> are free to do what they want with PUA characters, including refusing to 
> handle them at all; indeed they can refuse to handle any other character 
> as there is no obligation to support any characters. But if they are to 
> be useful applications for many users, they would be well advised to 
> offer support for as many characters as possible.
> 
Aye, this is exactly what I was talking about, I was just using "application" 
to refer to any piece of software involved in any stage, including the 
communicating applications.
--
Jon Hanna

*Thought provoking quote goes here*



Re: Definitions

2003-11-14 Thread Peter Kirk
On 14/11/2003 00:20, Philippe Verdy wrote:

From: <[EMAIL PROTECTED]>
 

Please see
http://www.microsoft.com/typography/otspec/recom.htm
... the section about "Shape of .notdef glyph"
   

Thanks for pointing a Microsoft recommandation for the undefined glyph
(glyph id=0) that every TT font should implement (so this would affect also
OT fonts).
But there are several fonts in Windows and Office that still display a
normal question mark for this glyph ID, instead of a narrow white box as
expected (this may be a caveat within the system compatibility font mappings
with system fonts which are not TrueType but simple .FON bitmap fonts)...




 

This is probably because Windows is first mapping your Unicode text on 
to a system code page. The normal behaviour with characters which don't 
have a compatibility type mapping is to map them to "?".

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Definitions

2003-11-14 Thread Philippe Verdy
From: "Jim Allan" <[EMAIL PROTECTED]>
> I take this to mean that any application can refuse to interpret PUA
> code points and still be conformant.

I would not say that, it would be excessive.
An application can use PUA the way it wants, but not line ill-formed
encoding sequences or non-characters.

Instead the application should choose the way to manage them. I think that
many applications choose to handle them like non-combining spacing
characters with "weak" LTR directionality (like punctuation, or ideographic
characters), and display a narrow white box.




Re: Definitions

2003-11-14 Thread Philippe Verdy
From: <[EMAIL PROTECTED]>
> Please see
> http://www.microsoft.com/typography/otspec/recom.htm
> ... the section about "Shape of .notdef glyph"

Thanks for pointing a Microsoft recommandation for the undefined glyph
(glyph id=0) that every TT font should implement (so this would affect also
OT fonts).

But there are several fonts in Windows and Office that still display a
normal question mark for this glyph ID, instead of a narrow white box as
expected (this may be a caveat within the system compatibility font mappings
with system fonts which are not TrueType but simple .FON bitmap fonts)...




Re: Definitions

2003-11-13 Thread jameskass
.
Jim Allan wrote,

> Probably the best solution would be to display a special glyph with the 
> meaning "character not supported".

TUS seems to suggest (4.0 on page 110) that various control pictures 
can be used in these special circumstances.  It might even be helpful 
for an application to use a special "character forbidden" glyph
if appropriate.

Earlier in this thread, I'd said I was responding to Jim Allan's
original post, of course it was Jon Hanna's.  It's just one of
those days.

Kent Karlsson wrote,

> And indeed IDN (Internationalised domain names) does so.
> Basically, IDNs aren't private, or, if you will, the established
> agreement for IDNs is not to interpret PUA characters at all,
> except for prohibiting them, as are surrogate code points
> (when not properly paired in UTF-16), non-characters, and
> code points that weren't assigned in Unicode 3.2 (the latter
> will change with a new version of IDN).

It seems unlikely that the existence of a ".com" would
unravel the fabric of the universe...

(In ASCII cipher, that would be "Qapla'.com", trying to send
CSUR PUA UTF-8 example, having problems with e-mailers.  Still.)

Best regards,

James Kass
.



Re: Definitions

2003-11-13 Thread Jim Allan
[EMAIL PROTECTED] wrote:

Unicode probably shouldn't impose any such requirement, the missing
glyph is not part of Unicode and is not mapped to any character.
The purpose and semantics of the missing glyph are:  'this is the
glyph that will be displayed by every application when the font
in use lacks a glyph assigned to the code point being called.'
Any other use of the missing glyph would be illegitimate and it
would also be highly misleading.
 

I quite agree.

Displaying either the specific missing glyph indicator in a particular 
font (most often an open rectangle) or displaying the glyph associated 
with U+FFFD would be misleading.

But in fact applications aren't consistent in their use of these or in 
the use of "?" as yet a third way of indicating a glyph that the 
application can't reproduce.

Probably the best solution would be to display a special glyph with the 
meaning "character not supported".

Jim Allan


Please see
http://www.microsoft.com/typography/otspec/recom.htm
... the section about "Shape of .notdef glyph"
Best regards,

James Kass
.
 






Re: Definitions

2003-11-13 Thread Jim Allan
[EMAIL PROTECTED] wrote:

Now as I review this thread (and find one of my very own typos),
I wonder if Jim Allan and I are "on the same page" when we
speak of "missing glyph"?  It means something very specific
in the font jargon.
 

I understand "missing glyph".

But these days different applications behave differently when there is a 
missing glyph, sometimes doing something other than showing the missing 
glyph symbol within the font.

Jim Allan






Re: Definitions

2003-11-13 Thread jameskass
.
> (Please note that in my original post, I was only expressing an
> opinion of the way things "should be", rather than stating that
> "this is way it is".)

Sigh, '...the way it is...'.

To clarify (or make another try at it), Jim Allan's original post
made it clear that he was expressing his interpretation of the
requirements.  I figured his take was probably "spot on", so I
wasn't objecting to his interpretation, but rather taking issue
with any requirements which would lead to this interpretation.

Now as I review this thread (and find one of my very own typos),
I wonder if Jim Allan and I are "on the same page" when we
speak of "missing glyph"?  It means something very specific
in the font jargon.

Best regards,

James Kass
.



Re: Definitions

2003-11-13 Thread jameskass
.
Jim Allan wrote,

> I take this to mean that any application can refuse to interpret PUA 
> code points and still be conformant.

(Please note that in my original post, I was only expressing an
opinion of the way things "should be", rather than stating that
"this is way it is".)

Quoting from TUS 4.0, page 110, section 5.3 Unknown and missing
characters:

"There are two classes of code points that even a "complete"
implementation of the Unicode Standard cannot necessarily 
interpret correctly: ..."

(One of the two classes is "PUA code points for which no private 
agreement exists". -- Since no application is truly sentient or
omniscient, no application can determine that "no private
agreement exists.")

"An implementation should not attempt to interpret such
code points.  However, in practice, applications must deal 
with unassigned code points or private use characters. ..."

"... An implementation should not blindly delete such characters, 
nor should it unintentionally transform them into something else."

In my book, the "unintentionally" should probably be dropped
from the above sentence.

> I do not find any rules as to what an application ought to do with code 
> points that it does not interpret. Unless I'm missing something, 
> substitution of a missing glyph indication would be conformant.
> 
> I think it would be better if such an application indicated this in some 
> other way than by the same missing glyph that it would use to indicate a 
> character was not found in the current font, but I don't see that 
> Unicode imposes any such requirement.

Unicode probably shouldn't impose any such requirement, the missing
glyph is not part of Unicode and is not mapped to any character.

The purpose and semantics of the missing glyph are:  'this is the
glyph that will be displayed by every application when the font
in use lacks a glyph assigned to the code point being called.'

Any other use of the missing glyph would be illegitimate and it
would also be highly misleading.

Please see
http://www.microsoft.com/typography/otspec/recom.htm
... the section about "Shape of .notdef glyph"

Best regards,

James Kass
.



Re: Definitions

2003-11-13 Thread Peter Kirk
On 13/11/2003 09:39, [EMAIL PROTECTED] wrote:

The source and the sink are higher level entities with their 
 

own higher level protocols.
   

Yes, and they would be examples of the first and second case I gave in my first 
mail on this thread.

The channel between source and sink, which 
 

is the Unicode level and below, should be transparent to PUA characters, 
indeed to all characters apart from defined transformations. That surely 
is the point of the PUA. If the channel starts messing around with the 
characters sent through it, that is what is non-conformant.
   

Such applications would match the third case.

The fourth case can be used to build any of the others.



 

But I am not offering alternatives. I am offering a single architecture. 
And you seem to be confusing applications with system and communication 
support for Unicode.

To cover your original cases, we need another layer. Look at it like 
this, in a monospace font:

- -
|  User   |   |  User   |
- -
|   App   |   |   App   |
- -
| Unicode |   | Unicode |
---
| Communication channel |
---
In this model, Unicode ... Unicode offers as defined a transparent 
channel for all characters including PUA (although normalisation etc is 
permitted), and if an implementation is not transparent it is 
non-conformant. The communicating applications built on top of Unicode 
are free to do what they want with PUA characters, including refusing to 
handle them at all; indeed they can refuse to handle any other character 
as there is no obligation to support any characters. But if they are to 
be useful applications for many users, they would be well advised to 
offer support for as many characters as possible.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Definitions

2003-11-13 Thread Jim Allan
James Kass posted:

Any application which substitutes missing glyphs for PUA characters,
when a valid font which covers those code points is active,
should be considered non conformant. 


From Unicode 4.0, Chapter 3, Conformance, rule C8:

<<
_C8  A process shall not assume that it is required to interpret any 
particular coded character representation._

o Process that interpret only a subset of Unicode characters are 
allowed; there is no blanket requirement to interpret _all_ Unicode 
characters.

o Any means for specifying a subset of characters that a process can 
interpret is outside the scope of this standard.
>>

I take this to mean that any application can refuse to interpret PUA 
code points and still be conformant.

I do not find any rules as to what an application ought to do with code 
points that it does not interpret. Unless I'm missing something, 
substitution of a missing glyph indication would be conformant.

I think it would be better if such an application indicated this in some 
other way than by the same missing glyph that it would use to indicate a 
character was not found in the current font, but I don't see that 
Unicode imposes any such requirement.

Jim Allan





Re: Definitions

2003-11-13 Thread jon
The source and the sink are higher level entities with their 
> own higher level protocols.

Yes, and they would be examples of the first and second case I gave in my first 
mail on this thread.

 The channel between source and sink, which 
> is the Unicode level and below, should be transparent to PUA characters, 
> indeed to all characters apart from defined transformations. That surely 
> is the point of the PUA. If the channel starts messing around with the 
> characters sent through it, that is what is non-conformant.

Such applications would match the third case.

The fourth case can be used to build any of the others.



Re: Definitions

2003-11-13 Thread Peter Kirk
On 13/11/2003 07:51, [EMAIL PROTECTED] wrote:

... and the only 
conformant applications are those which pass PUA characters through untouchted, 
though they would generally do so with a source and/or sink that assigns 
meaning and hence the system as a whole is still non-conformant.

 

Not if the source and the sink are consenting adults, or children, or 
processes, which can assign meaning to PUA characters by private 
agreement. The source and the sink are higher level entities with their 
own higher level protocols. The channel between source and sink, which 
is the Unicode level and below, should be transparent to PUA characters, 
indeed to all characters apart from defined transformations. That surely 
is the point of the PUA. If the channel starts messing around with the 
characters sent through it, that is what is non-conformant.

If the higher level protocol chooses not to use PUA characters, it is of 
course entitled not to, and in that case to treat as a protocol error 
any PUA characters it receives from a lower layer.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: Definitions

2003-11-13 Thread jon
> /Adults/ can say no (as indeed can non-adults), but /consenting/ adults 
> are, by definition, adults who say yes. If they say no, they are not 
> consenting. Consenting, by definition, means saying yes.

Consenting means saying yes when you can say no. Saying yes when a no won't be 
listened to is just non-resistance.

> James's statement ("any application which restricts PUA use is 
> effectively precluding consenting adults from reaching and implementing 
> their private agreements") is correct. If you choose to redefine the 
> word "consenting" to mean "one who consents to using an application 
> which restricts the PUA" then I would argue that's just a silly 
> redefinition.

No, it's deciding what to do with the PUA. By this logic any application which 
does apply semantics to characters in the PUA is equally non-conformant because 
it is restricting the use of the PUA to the defined behaviou - and the only 
conformant applications are those which pass PUA characters through untouchted, 
though they would generally do so with a source and/or sink that assigns 
meaning and hence the system as a whole is still non-conformant.



RE: Definitions

2003-11-13 Thread Kent Karlsson

> I see no reason why a protocol cannot introduce a 
> higher-level rule which prohibits the use of PUA characters. 

And indeed IDN (Internationalised domain names) does so.
Basically, IDNs aren't private, or, if you will, the established
agreement for IDNs is not to interpret PUA characters at all,
except for prohibiting them, as are surrogate code points
(when not properly paired in UTF-16), non-characters, and
code points that weren't assigned in Unicode 3.2 (the latter
will change with a new version of IDN).

/kent k

http://www.ietf.org/rfc/rfc3454.txt
http://www.ietf.org/rfc/rfc3491.txt
http://www.ietf.org/rfc/rfc3490.txt


smime.p7s
Description: S/MIME cryptographic signature


RE: Definitions

2003-11-13 Thread Jill Ramonsky





Adults can say no (as indeed can non-adults), but consenting
adults are, by definition, adults who say yes. If they say no, they are
not consenting. Consenting, by definition, means saying yes.

James's statement ("any application which restricts PUA use is
effectively precluding consenting adults from reaching and implementing
their private agreements") is correct. If you choose to redefine the
word "consenting" to mean "one who consents to using an application
which restricts the PUA" then I would argue that's just a silly
redefinition. A bit like defining a non-brothel as a place where
consenting adults can choose not to pay each other for sex.
It's not what most of us mean by "consenting". I argue that James is
correct, by any reasonable definition.

Jill


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, November 13, 2003 12:38 PM
> To: Unicode List
> Subject: Re: Definitions
 
> Consenting adults can say no.





Re: Definitions

2003-11-13 Thread jon
> Any application which bans or prevents the interchange or storage
> of PUA code points should be considered non conformant.
> 
> Any application which substitutes missing glyphs for PUA characters,
> when a valid font which covers those code points is active, 
> should be considered non conformant.

I see no reason why a protocol cannot introduce a higher-level rule which 
prohibits the use of PUA characters. Not only do I see this as being implied by 
the fact that a protocol can apply whatever meaning is appropriate to PUA 
characters (therefore why not "non-character"), but I can see a point in doing 
so in a protocol which by its nature is not suitable for customisation.

This isn't to say that I think its necessarily a good idea to do this when 
designing a protocol - I would want to see a clear problem clearly solved by 
doing so.

I'll also add that if a protocol was intended to allow free exchange of textual 
content then it should not prohibit use of the PUA. However if it is intended 
that any user of the protocol would be able to infer the same meaning from a 
message as any other user then it might be worth considering.

> Since the PUA is for consenting adults, any application which restricts
> PUA use is effectively precluding consenting adults from reaching
> and implementing their private agreements.

Consenting adults can say no.



Re: Definitions

2003-11-13 Thread jameskass
.
Jon Hanna wrote,

> As I see it the following behaviours would all be conformant:
> 

Jon offered opinions about PUA and conformance and was gracious
enough to indicate that it was opinion.

Here's my take, FWIW:

Any application which bans or prevents the interchange or storage
of PUA code points should be considered non conformant.

Any application which substitutes missing glyphs for PUA characters,
when a valid font which covers those code points is active, 
should be considered non conformant.

Like it or not, the PUA is part of Unicode.  It's a free zone, and the
*only* restriction on its use is that TUC will never assign characters
in the PUA ranges.

Since the PUA is for consenting adults, any application which restricts
PUA use is effectively precluding consenting adults from reaching
and implementing their private agreements.

We should never allow those who disdain the PUA to determine its
boundaries.

Best regards,

James Kass
.



Re: Definitions

2003-11-13 Thread jon
Quoting Chris Jacobs <[EMAIL PROTECTED]>:

> "The interpretation of private use characters (Co) as graphic characters or
> not is determined by private agreement."
> "The interpretation of private use characters (Co) as base characters or not
> is determined by private agreement."
> "The interpretation of Private Use characters (Co) as combining characters or
> not is determined by private agreement. "
> 
> Is this just another way of saying that this is left undefined, or does it
> imply that a conformant application should be able to detect if private
> agreements exist?

In my reading a bit of both. These properties for these characters are 
undefined by Unicode.

As I see it the following behaviours would all be conformant:

1. Your application uses these characters in accordance to a private agreement 
between other users of the protocol the application was built to support.

2. Your application works with a protocol which has a rule against the use of 
private use characters (such a rule would be at a higher level than Unicode) 
and it throws and error in such cases (this is really a variant on the first 
possibility - essentially a private agreement that these characters are non-
characters and should not occur).

3. Your application has no "knowledge" of any private agreement. It treats 
private use characters as graphic non-combining characters, rendering them with 
an indicator of a unrenderable character ("box" shapes and question marks are 
common glyphs for use here, the Last Resort font offers a glyph which indicates 
that it is a private use character rather than any other unknown character) 
and/or passing them to the next processing step unchanged.

4. Your application behaves as in the item above, but offers a mechanism to 
override this behaviour (particularly useful if it were a library rather than 
an application per se).

It is not conformant to "fix" these characters by replacing them with other 
characters, though obviously a application like item 1 or 2 can do whatever 
operation it is meant to do with them.

--
Jon Hanna

*Thought provoking quote goes here*