Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Asmus Freytag
At 01:44 PM 10/15/03 -0700, Peter Kirk wrote:
The guidelines are concerned with the average case: displaying the 
characters as *text*.

[The use of the word 'must' in a guideline is always awkward, since that 
word has such a strong meaning in the normative part of the standard.]
So, are you saying that for normal display of characters as text these 
guidelines must be followed?


No guidelines 'must' ever be followed. But if you want to follow the letter 
of the guidelines, you 'must' do this. ;-)

A better phrasing would have been: "These guidelines strongly recommend 
that you always do this, unless you have very good reasons not to, but in 
that case we hope that you are thinking of something else than ordinary 
display of characters as text, as you will sorely violate the expectations 
of your users and endanger the interoperability of Unicode encoded text 
from different sources."

Would that have been better?

A./

PS: perhaps we should have a little icon for that.





Re: Java char and Unicode 3.0+ (was:Canonical equivalence in rendering: mandatory or recommended?)

2003-10-15 Thread John Cowan
Philippe Verdy scripsit:

> [...] char, whose values are 16-bit unsigned integers
> representing Unicode characters (section 2.1).

Despite your ingenious special pleading, I don't see how this can mean
anything except that chars must be 16-bit unsigned integers.

> The Java language still lacks a way to specify a literal for a character out
> of the BMP. Of course one can use the syntax '\uD800\uDC00' but this would
> not compile with the current _compilers_, that expect only one char in the
> literal. In a String literal "\uD800\uDC00" becomes the 4-bytes UTF-8
> sequence for _one_ Unicode codepoint in the compiled class.

Character literals are crocky anyhow.  IMHO modern programming languages
should not have a Character type, but deal only in Strings.

> 2. The initial spec of UTF-32 and UTF-8 by ISO allowed much more planes with
> 31-bit codepoints, and may be there will be an agreement sometime in the
> future between ISO and Unicode to define new codepoints out of the current
> standard 17 first planes that can be safely converted with UTF-16, 

I doubt it very much.  17 planes is wy more than sufficient.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Assent may be registered by a signature, a handshake, or a click of a computer
mouse transmitted across the invisible ether of the Internet. Formality
is not a requisite; any sign, symbol or action, or even willful inaction,
as long as it is unequivocally referable to the promise, may create a contract.
   --_Specht v. Netscape_



Re: Nothing at all to do with: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread John Cowan
Peter Kirk scripsit:

> You told me then that normalisation is not mandatory and so 
> effectively only recommended.  But if a reader is recommended to reject 
> non-normalised input, the effect is that normalisation is mandated 
> except for private communication between a group of cooperating 
> processes.

The intention is that it's cheaper overall for the creator of the document
to normalize it once than for every receiver of the document to normalize
it, potentially many times over.

> So, while for example I may put a non-normalised text on my 
> website, it would be rather pointless because any browser following 
> recommendations would reject my page. Is that correct? 

Yes, but I think it *very* unlikely that any general-use browser would
ever enforce that recommendation.  In general, browsers are written to
accept as much as possible, at least in the HTML environment.
Even in a purely XML 1.1 world (which is unlikely to arrive for a number
of years!), I think that browsers would be built to perhaps warn the
user about non-normalized content, but by no means to reject it out of
hand.

The importance of normalization arises in machine-to-machine communication,
where the danger of being spoofed by non-normalized content that passes
unsubtly written filters is great.  XML does not consider documents
equivalent merely because they are canonically equivalent; an element or
attribute name must be identical at the codepoint level to be correctly
recognized.

> Am I in fact 
> forced to work on the basis that normalisation is mandatory?

No.

-- 
Not to perambulate John Cowan <[EMAIL PROTECTED]>
the corridors  http://www.reutershealth.com
during the hours of repose http://www.ccil.org/~cowan
in the boots of ascension.   --Sign in Austrian ski-resort hotel  



Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Peter Kirk
On 15/10/2003 10:48, Asmus Freytag wrote:

I'm going to answer some of Peter's points, leaving aside the 
interesting digressions into Java subclassing etc. that have developed 
later in the discussion.
Thank you, Asmus. If people want to discuss normalisation and string 
handling in Java, they are welcome to do so, but they should use a 
different subject heading and not my (copyrighted :-) ) text.

At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:

I note the following text from section 5.13, p.127, of the Unicode 
standard v.4:

Canonical equivalence must be taken into account in rendering 
multiple accents, so that any two canonically equivalent sequences 
display as the same.

This statement goes to the core of Unicode. If it is followed, it 
guarantees that normalizing a string does not change its appearance 
(and therefore it remains the 'same' string as far as the user is 
concerned.)

...

The guidelines are concerned with the average case: displaying the 
characters as *text*.

[The use of the word 'must' in a guideline is always awkward, since 
that word has such a strong meaning in the normative part of the 
standard.]
So, are you saying that for normal display of characters as text these 
guidelines must be followed?


Rendering systems should handle any of the canonically equivalent 
orders of combining
marks. This is not a performance issue: The amount of time necessary 
to reorder combining
marks is insignificant compared to the time necessary to carry out 
other work required
for rendering.

The interesting digressions on string libraries aside, the statement 
made here is in the context of the tasks needed for rendering. If you 
take a rendering library and add a normalization pass on the front of 
it, you'll be hard-pressed to notice a difference in performance, 
especially for any complex scripts.

So we conclude: "rendering any string as if it was normalized" is 
*not* a performance issue.
Thank you. This is the clarification I was looking for, and confirms my 
own suspicions. But are there any other views on this? I have heard  
them from implementers of rendering systems. But I have wondered if this 
is because of their reluctance to do the extra work required to conform 
to this requirement.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Java char and Unicode 3.0+ (was:Canonical equivalence in rendering: mandatory or recommended?)

2003-10-15 Thread Philippe Verdy
From: "Nelson H. F. Beebe" <[EMAIL PROTECTED]>
To: "Philippe Verdy" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; "Jill Ramonsky" <[EMAIL PROTECTED]>
Sent: Wednesday, October 15, 2003 5:34 PM
Subject: Re: Canonical equivalence in rendering: mandatory or recommended?

> [This is off the unicode list.]
>
> Philippe Verdy wrote on the unicode list on Wed, 15 Oct 2003 15:44:44
> +0200:
>
> >> ...
> >> Looking at the Java VM machine specification, there does not
> >> seem to be something implying that a Java "char" is necessarily a
> >> 16-bit entity. So I think that there will be sometime a conforming
> >> Java VM that will return UTF-32 codepoints in a single char, or
> >> some derived representation using 24-bit storage units.
> >> ...
>
> I disagree: look at p. 62 of
>
> @String{pub-AW  = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
> @String{pub-AW:adr  = "Reading, MA, USA"}
>
> @Book{Lindholm:1999:JVM,
>   author =   "Tim Lindholm and Frank Yellin",
>   title ="The {Java} Virtual Machine Specification",
>   publisher =pub-AW,
>   address =  pub-AW:adr,
>   edition =  "Second",
>   pages ="xv + 473",
>   year = "1999",
>   ISBN = "0-201-43294-3",
>   LCCN = "QA76.73.J38L56 1999",
>   bibdate =  "Tue May 11 07:30:11 1999",
>   price ="US\$42.95",
>   acknowledgement = ack-nhfb,
> }
>
> where it states:
>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >>   characters (section 2.1)

Personnally I read this reference of the second edition of the Java VM:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html

Notably:
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.htm
which states clearly that "char" is not a integer type, but a numeric type
without a constraint on the number of bits it contains (yes there has
existed JVM implementations with 9-bit chars!)

It states:
[quote]
2.4.1 Primitive Types and Values

A primitive type is a type that is predefined by the Java programming
language and named by a reserved keyword. Primitive values do not share
state with other primitive values. A variable whose type is a primitive type
always holds a primitive value of that type.(*2)
[...]
The integral types are byte, short, int, and long, whose values are
8-bit, 16-bit, 32-bit, and 64-bit signed two's-complement integers,
respectively, and char, whose values are 16-bit unsigned integers
representing Unicode characters (section 2.1).
*2: Note that a local variable is not initialized on its creation and is
considered to hold a value only once it is assigned (section 2.5.1).
[/quote]

Then it defines the important rules for char conversions or promotions,
for arithmetic operations, assignments or method invokation:

[quote]
2.6.2 Widening Primitive Conversions [...]
* char to int, long, float, or double [...]
Widening conversions do not lose information about the sign or order of
magnitude of a numeric value. Conversions widening from an integral type to
another integral type do not lose any information at all; the numeric value
is preserved exactly. [...]
According to this rule, a widening conversion of a signed integer value to
an integral type simply sign-extends the two's-complement representation of
the integer value to fill the wider format. A widening conversion of a value
of type char to an integral type zero-extends the representation of the
character value to fill the wider format.
Despite the fact that loss of precision may occur, widening conversions
among primitive types never result in a runtime exception (section 2.16).
[/quote]

So a 'char' MUST have AT MOST the same number of bits as an 'int', i.e. it
cannot have more than 32 bits. If char was defined to have 32 bits, no
zero-extension would occur but the above rule would be still valid.

and:

[quote]
2.6.3 Narrowing Primitive Conversions [...]
* char to byte or short [...]
Narrowing conversions may lose information about the sign or order of
magnitude, or both, of a numeric value (for example, narrowing an int value
32763 to type byte produces the value -5). Narrowing conversions may also
lose precision.[...]
Despite the fact that overflow, underflow, or loss of precision may occur,
narrowing conversions among primitive types never result in a runtime
exception.
[/quote]

Yes this paragraph says that char to short may loose information. This
currently does not occur with unsigned 16-bit chars, but this could happen
safely with 32-bit chars without violating the rule.

However, as the sign must be kept when converting char to int, this means
that the new 32-bit char would have to be signed. Yes this means that there
would exist now negative values for chars but none of them are currently
used with 16-bit chars which are in the range [\u..\u] and promoted
as if they were integers in range [0..65535]. That's why a narrowing
conversion occurs when converting to short (the sign may ch

Nothing at all to do with: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Peter Kirk
On 15/10/2003 10:00, John Cowan wrote:

...

The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.
 

This is nothing at all to do with my canonical equivalence question, but 
does touch on my other question today about normalisation in XML and 
HTML. You told me then that normalisation is not mandatory and so 
effectively only recommended.  But if a reader is recommended to reject 
non-normalised input, the effect is that normalisation is mandated 
except for private communication between a group of cooperating 
processes. So, while for example I may put a non-normalised text on my 
website, it would be rather pointless because any browser following 
recommendations would reject my page. Is that correct? Am I in fact 
forced to work on the basis that normalisation is mandatory?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Philippe Verdy
From: "Nelson H. F. Beebe" <[EMAIL PROTECTED]>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >>   characters (section 2.1)

May be I have missed this line, but the Java bytecode instruction set does
make a native char not equivalent to a short, as there's an explicit
conversion
instruction between char's and other integer types.

Those Java programs that assume that a char is 16-bit wide would have a
big problem because of the way String constants are serialized within
class files.

Of course it will be hard to include new primitive types in the JVM (because
it would require adding new instructions). And for the same reason, the Java
language "char" type is bound to the JVM native char type, and it cannot
be changed, meaning that String objects are bound to contain only "char"
elements (but this is done only through the String and StringBuffer methods,
as the backing array is normally not acessible).

I don't see where is the issue if char is changed from 16-bit to 32-bit
wide, included with the existing bytecode instruction set and the String
API (for compatibility the existing methods should use indices as if the
backing array was storing 16-bit entities, even if in fact it would store
now
32-bit char's.)

But an augmented set of methods could use directly the new indices in
the native 32-bit char. Additionally, there may be an option bit set in
compiled classes to specify the behavior of the String API: with this bit
set, the class loader would bind the String API methods to the new 32-bit
version of the new core library, and without this bit set (legacy compiled
classes), they would use the compatibility 16-bit API.

Te javac compiler already sets version information for the target JVM, and
thus can be used to compile a class to use the new 32-bit API instead of the
legacy one: in this mode, for example, the String.length() method in the
source would be compiled to call the String.length32() method of the new
JVM, or to remap it to String32.length(), with a replacement class name
(I use String32 here, but in fact it could be the UString class of ICU).

I am not convinced that Java must be bound to a fixed size for its "char"
type, as it already includes "byte", "short", "int", "long" for integer
types
with known sizes (respectively 8, 16, 32, 64 bits), and the JVM bytecode
instruction set clearly separates the native char type from the native
integer type, and does not allow arithmetic operations between chars and
integer types without an explicit conversion.

Note also that JNI is not altered with this change: when a JNI program uses
the .getStringUTF() method, it expects a UTF-8 string. When it uses the
.getString(), it expects a UTF-16 string (not necessarily the native char
encoding seen in Java), and an augmented JNI interface could be defined
to use .getStringUTF32() is one wants maximum performance with no
conversion with the internal backing store used in the JVM.


For me the "caload" instruction, for example just expects to return the
"char" (whatever its size) at a fixed and know index. There's no way to
break this "char" into its bit components without using an explicit
conversion
to a native integer type: the char is unbreakable, and the same is true for
String constants (and that's why they can be reencoded internally into
UTF-8 in compiled classes and why String constants can't be used to store
reliably any array of 16-bit integers, because of surrogates, as a String
constant cannot include invalid surrogate pairs)

Out of String constants, all other strings in Java can be built from
serialized data only through a encoding converter with arrays of native
integers. It's up to the converter (not to the JVM itself) to parse the
native
integers in the array to build the internal String. The same is true for
StringBuffers, which could as well use an internal backing store with 32-bit
chars.

So the real question when doing this change from 16bit to 32-bit is whever
and how it will affect performance of existing applications: for Strings,
this may be an issue if conversions are needed to get an UTF-16 view of a
String internally using a UTF-32 backing store. But in fact, an internal
(private) attribute "form" could store this indicator, so that construction
of Strings will not suffer from performance issues. In fact, if the JVM
could internally manage several alternate encoding forms for Strings to
reduce
the footprint of a large application and just proded a "char" view through
an
emulation API to applcations, this could benefit to the performance (notably
in server/networking applications using lots of strings, such as SQL engines
like Oracle that include a Java VM).

What would I see if I was a programmer in such environment: the main
change would be in the character properties, where new Unicode block
identifiers would become accessible out of the BMP, and no "char" would
be identified as a low or high surrogate.

It would be quite simple to know if the 32-b

Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Asmus Freytag
I'm going to answer some of Peter's points, leaving aside the interesting 
digressions into Java subclassing etc. that have developed later in the 
discussion.

At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:
I note the following text from section 5.13, p.127, of the Unicode 
standard v.4:

Canonical equivalence must be taken into account in rendering multiple 
accents, so that any two canonically equivalent sequences display as the same.
This statement goes to the core of Unicode. If it is followed, it 
guarantees that normalizing a string does not change its appearance (and 
therefore it remains the 'same' string as far as the user is concerned.)

The word "must" is used here. But this is part of the "Implementation 
Guidelines" chapter which is generally not normative. Should this sentence 
with "must" be considered mandatory, or just a recommendation although in 
certain cases a "particularly important" one?
If you read the conformance requirements you deduce that any normalized or 
unnormalized form of a string  must represent the same 'content' on 
interchange. However, the designers of the standard wanted to make even 
specialized uses, such as 'reveal character codes' explicitly conformant. 
Therefore you are free to show to a user whether a string is precomposed or 
composed of combining characters, e.g. by using a different font color for 
each character code.

The guidelines are concerned with the average case: displaying the 
characters as *text*.

[The use of the word 'must' in a guideline is always awkward, since that 
word has such a strong meaning in the normative part of the standard.]

Rendering systems should handle any of the canonically equivalent orders 
of combining
marks. This is not a performance issue: The amount of time necessary to 
reorder combining
marks is insignificant compared to the time necessary to carry out other 
work required
for rendering.
The interesting digressions on string libraries aside, the statement made 
here is in the context of the tasks needed for rendering. If you take a 
rendering library and add a normalization pass on the front of it, you'll 
be hard-pressed to notice a difference in performance, especially for any 
complex scripts.

So we conclude: "rendering any string as if it was normalized" is *not* a 
performance issue.

However, from the other messages on this thread we conclude: normalizing 
*every* string, *every time* it gets touched, *is* a performance issue.

A few things: Unicode supports data that allow to perform a 'Normalization 
Quick Check', which simply determines whether there is anything that might 
be affected by normalization. (For example, nothing in this e-mail message 
is affected by normalization, no matter to which form, since it's all in 
ASCII.)

With a quick check like that you should be able to reduce the cost of 
normalization dramatically --unless your data consists of data that needs 
normalization throughout. Even then, if there is a chance that the data is 
already normalized, verifying that is faster than normalizing (since 
verification doesn't re-order).

Then, after that, as others have pointed out, if you can keep track of a 
normalized state, either by recordkeeping or by having interfaces inside 
which the data is guaranteed to be normalized, then you cut your costs furhter.

A./





Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Markus Scherer
Philippe Verdy wrote:
... In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is
This does or does not save space and time depending on the average string contents and on what kind 
of processing you do.

chosen dynamically, depending on usage of that string: if
the string is not accessed often with char indices (which in
Java does not return actual Unicode codepoint indices as
there may be surrogates) the UTF-8 representation uses less
memory in most cases.
It is possible, with a custom class loader to overide the default
String class used in the Java core libraries (note that compiled
Java .class files use UTF-8 for internally stored String constants,
No. It's close to UTF-8, but .class files use a proprietary encoding instead of UTF-8. See the 
.class file documentation from Sun.

as this allows independance with the architecture, and this is the
class loader that transforms the bytes storage of String constants
into actual chars storage, i.e. currently UTF-16 at runtime.)
Looking at the Java VM machine specification, there does not
seem to be something implying that a Java "char" is necessarily a
16-bit entity. So I think that there will be sometime a conforming
Java VM that will return UTF-32 codepoints in a single char, or
some derived representation using 24-bit storage units.
I don't know about the VM spec, but the language and its APIs have 16-bit chars wired deeply into 
them. It would be possible to _add_ a new char32 type, but that is not planned, as far as I know. 
_Changing_ char would break all sorts of code. However, as far as I have heard, a future Java 
release may provide access to Unicode code points and use ints for them.

(And please do not confuse using a single integer for a code point with UTF-32 - UTF-32 is an 
encoding form for _strings_ requiring a certain bit pattern. Integers containing code points are 
just that, integers containing code points, not any UTF.)

So there already are some changes of representation for Strings in
Java, and similar technics could be used as well in C#, ECMAScript,
and so on...
I am quite confident that existing languages like these will keep using 16-bit Unicode strings, for 
the same reasons as for Java: Changing the string units would break all kinds of code.

Besides, most software with good Unicode support and non-trivial string handling uses 16-bit Unicode 
strings, which avoids transformations where software components meet.

... Depending of runtime
tuning parameters, the internal representation of String objects may
(should) become transparent to applications. One future goal
The internal representation is already transparent in languages like Java. The API behavior has to 
match the documentation, though, and cannot be changed on a whim.

would be that a full Unicode String API will return real characters
as grapheme clusters of varying length, in a way that can be
comparable, orderable, etc... to better match what the users
consider as a string "length" (i.e. a number of grapheme clusters,
if not simply a combining sequence if we exclude the more complex
case of Hangul Jamos and Brahmic clusters).
This is overkill for low-level string handling, and is available via library functions. Such library 
functions might be part of a language's standard libraries, but won't replace low-level access 
functions.

Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.



Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Markus Scherer
Jill Ramonsky wrote:
I had to write an API for my employer last year to handle some aspects 
of Unicode. We normalised everything to NFD, not NFC (but that's easier, 
not harder). Nonetheless, all the string handling routines were not 
allowed to assume that the input was in NFD, but they had to guarantee 
that the output was. These routines, therefore, had to do a "convert to 
NFD" on every input, even if the input were already in NFD. This did 
have a significant performance hit, since we were handling (Unicode) 
strings throughout the app.
Note that, in addition to "is normalized" flags, it is much faster to check whether a string is 
normalized, and to normalize it only if it's not. This at least if there is a good chance that the 
string is normalized - as appears to be true in your application, and is usually true where most 
other applications check for NFC on input. See UAX #15 for details. ICU has quick check and 
normalization functions.

markus




Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread John Cowan
Jill Ramonsky scripsit:

> I had to write an API for my employer last year to handle some aspects 
> of Unicode. We normalised everything to NFD, not NFC (but that's easier, 
> not harder). Nonetheless, all the string handling routines were not 
> allowed to /assume/ that the input was in NFD, but they had to guarantee 
> that the output was. These routines, therefore, had to do a "convert to 
> NFD" on every input, even if the input were already in NFD. This did 
> have a significant performance hit, since we were handling (Unicode) 
> strings throughout the app.

Indeed it would.  However, checking for normalization is cheaper than
normalizing, and Unicode makes properties available that allow a streamlined
but incomplete check that returns "not normalized" or "maybe normalized".
So input can be handled as follows:

if maybeNormalized(input)
thenif normalized(input)
thendoTheWork(input)
elsedoTheWork(normalize(input))
fi
elsedoTheWork(normalize(input))
fi

The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.

> I think that next time I write a similar API, I wll deal with 
> (string+bool) pairs, instead of plain strings, with the bool  meaning 
> "already normalised". This would definitely speed things up. Of course, 
> for any strings coming in from "outside", I'd still have to assume they 
> were not normalised, just in case.

W3C refers to this concept as "certified text".  It's a good idea.

> Jill
> 

-- 
Verbogeny is one of the pleasurettesJohn Cowan <[EMAIL PROTECTED]>
of a creatific thinkerizer. http://www.reutershealth.com
   -- Peter da Silvahttp://www.ccil.org/~cowan



Re: [OT] Adelphia mojibake problem solved

2003-10-15 Thread Doug Ewell
Philippe Verdy  wrote:

> The standard default setting is normally:
>  AddType "text/html" html
> without a charset indicator.
>
> With this setting, you are forcing _all_ HTML pages to be declared
> with UTF-8. If this is true for your site, then that's good. But if
> you need to have some pages declared differently (for example when
> showing sample pages encoded with "shift_jis"), you'll get another
> similar problem...

Not a problem for me.  I'm committed to using Unicode.  And while there
are some interesting sites out there that present the same information
in different encodings (so you can see, for example, which font your
browser chooses), mine won't be one of them.

> I don't know which webserver they use,

Apache 2.0

> but recent versions of Apache can read and interpret the content of
> HTML pages to autodetect the UTF forms or use the 
> tags to set or change additional HTTP headers, according to what
> authors desired on their pages. Same thing for XML files that are sent
> according to the charset found in the leading XML declaration line.

IF the administrators don't sabotage the whole deal by including the
line "AddDefaultCharset ISO-8859-1".  Contrary to the normal meaning of
"default," this option apparently forces ALL pages to be served as ISO
8859-1, including XHTML pages like mine that specify UTF-8 in both the
XML declaration AND in the  tag.  Even adding a U+FEFF
signature wasn't enough to convince Apache that the page was UTF-8
(though it did convince Internet Explorer).

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




RE: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Marco Cimarosti
Jill Ramonsky wrote:
> In my experience, there is a performance hit.
>
> I had to write an API for my employer last year to handle
> some aspects of Unicode. We normalised everything to NFD,
> not NFC (but that's easier, not harder). Nonetheless, all
> the string handling routines were not allowed to assume
> that the input was in NFD, but they had to guarantee that
> the output was. These routines, therefore, had to do a
> "convert to NFD" on every input, even if the input were
> already in NFD. This did have a significant performance
> hit, since we were handling (Unicode) strings throughout
> the app.
>
> I think that next time I write a similar API, I wll deal
> with (string+bool) pairs, instead of plain strings, with
> the bool  meaning "already normalised". This would
> definitely speed things up. Of course, for any strings
> coming in from "outside", I'd still have to assume they
> were not normalised, just in case.

You could have split the NFD process in two separate steps:

1) Decomposition per se;

2) Reordering of combining classes.

You could have performed step 1 (which is presumably much heavier than 2)
only on strings coming from "outside", and step 2 at every passage.

In a further enhancement, step 2 could be called only upon operations which
could produce non-canonical order: e.g. when concatenating strings but not
when trimming them.

To gain even more speed, you could implement an ad-hoc version of step 2
which only operates on out-of order characters adjacent to a specified
location in the string (e.g., the joining point of a concatenation
operation).

Just my 0.02 euros.

_ Marco



RE: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Jill Ramonsky





> -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]]
>
> The same
> optimization can be done in Java by subclassing the String
> class to add a "form" field and related form conversion (getters)
> and tests methods.

Only slightly confused about this. The Java String class is declared final
in the API, and therefore cannot be subclassed. One would have to write
an alternative String class (not rocket science of course, but still a
tad more involved than subclassing).

> In fact, to further optimize and reduce the
> memory footprint of Java strings, in fact I choosed to store
> the String in a array of bytes 

Okay. That explains that then.


> It is possible, with a custom class loader to overide the default
> String class used in the Java core libraries

Ouch. Never taken Java that far myself. I like the idea though. Is it
difficult?


> Looking at the Java VM machine specification, there does not
> seem to be something implying that a Java "char" is necessarily a
> 16-bit entity. So I think that there will be sometime a conforming
> Java VM that will return UTF-32 codepoints in a single char, or
> some derived representation using 24-bit storage units.

I've wondered about that ever since Unicode went to 21 bits. Actually
of course, it's C (and C++), not Java,  which has the real problem. C
is (supposed to be) portable, but fast on all architectures, so all of
the built-in types have platform-dependent widths. (So far so good).
The annoying thing is that, BY DEFINITION, the sizeof()
operator returns the size of an object measured in chars.
Therefore, it is a violation of the rules of C to have an addressable
object smaller than a char. One can have 32-bit chars, but only
if you disallow bytes and 16-bit words. sizeof() is not allowed
to return a fraction. Sigh! If only C had seen fit to measure
addressable locations in bits, or even architecture-specific-atoms
(which would have been 8-bits wide on most systems), then we could have
had sizeof(char) returning 4 or something. Ah well.


 
> This leads to many discussions about what is a "character"

I think we just had that discussion. If it happens again I'm probably
not going to join in (though it was quite amusing).

Jill







Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Philippe Verdy
From: Jill Ramonsky 

> I think that next time I write a similar API, I will deal with
> (string+bool) pairs, instead of plain strings, with the bool
> meaning "already normalised". This would definitely speed
> things up. Of course, for any strings coming in from
> "outside", I'd still have to assume they were not
> normalised, just in case.

I had the same experience, and I solved it by using an
additional byte in string objects that contains the
current normalization form(s) as a bitfield with bits set
if the string is either in NFC, NFD, NFKC or KFKD form.

This bitfield is all zeroes for unknown (still unparsed)
forms, and one bit is set if the string has already been
parsed: I don't always have to require a string to be in
any NF* form, so I perform normalization only when needed
by testing this bit first which indicates if parsing is required.

This saves lots of unnecessary string allocations and copies
and reduces a lot the process VM footprint. The same
optimization can be done in Java by subclassing the String
class to add a "form" field and related form conversion (getters)
and tests methods. In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is
chosen dynamically, depending on usage of that string: if
the string is not accessed often with char indices (which in
Java does not return actual Unicode codepoint indices as
there may be surrogates) the UTF-8 representation uses less
memory in most cases.

It is possible, with a custom class loader to overide the default
String class used in the Java core libraries (note that compiled
Java .class files use UTF-8 for internally stored String constants,
as this allows independance with the architecture, and this is the
class loader that transforms the bytes storage of String constants
into actual chars storage, i.e. currently UTF-16 at runtime.)

Looking at the Java VM machine specification, there does not
seem to be something implying that a Java "char" is necessarily a
16-bit entity. So I think that there will be sometime a conforming
Java VM that will return UTF-32 codepoints in a single char, or
some derived representation using 24-bit storage units.

So there already are some changes of representation for Strings in
Java, and similar technics could be used as well in C#, ECMAScript,
and so on... Handling UTF-16 surrogates will then be something of
the past, except if one uses the legacy String APIs that will
continue to emulate UTF-16 code unit indices. Depending of runtime
tuning parameters, the internal representation of String objects may
(should) become transparent to applications. One future goal
would be that a full Unicode String API will return real characters
as grapheme clusters of varying length, in a way that can be
comparable, orderable, etc... to better match what the users
consider as a string "length" (i.e. a number of grapheme clusters,
if not simply a combining sequence if we exclude the more complex
case of Hangul Jamos and Brahmic clusters).

This leads to many discussions about what is a "character"... This
may be context specific (depending on the application needs, the
system locale, or user preferences)... For XML, which recommends
(but does not mandate) the NFC form, it seems that the definition
of a character is mostly the combining sequence. It is very strange
however that this is a SHOULD and not a MUST, as this may return
unpredictable results in XML applications depending on whever the
SHOULD is implemented or not in the XML parser or transformation
engine.




Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Peter Kirk
On 15/10/2003 05:08, Jill Ramonsky wrote:

> -Original Message-
> From: Peter Kirk [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, October 15, 2003 12:19 PM
> To: Unicode List
> Subject: Canonical equivalence in rendering: mandatory or recommended?
>
>
> Does everyone agree that "This is not a performance issue"?
In my experience, there /is/ a performance hit.

...

Thank you, Jill. Clearly there is a performance hit in this rather 
general case or in an application in which string handling is dominant 
and speed critical. My question was more specific to rendering processes 
for complex scripts, where string handling is not already a major part 
of the processing but matters of glyph selection and positioning are. My 
instinct would be that in such circumstances the extra processing 
required for normalisation is almost trivial, especially with 
appropriate caching etc. But I have heard other opinions.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Jill Ramonsky





> -Original Message-
> From: Peter Kirk [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 15, 2003 12:19 PM
> To: Unicode List
> Subject: Canonical equivalence in rendering: mandatory or
recommended?
> 
> 
> Does everyone agree that "This is not a performance issue"?

In my experience, there is a performance hit.

I had to write an API for my employer last year to handle some aspects
of Unicode. We normalised everything to NFD, not NFC (but that's
easier, not harder). Nonetheless, all the string handling routines were
not allowed to assume that the input was in NFD, but they had
to guarantee that the output was. These routines, therefore, had to do
a "convert to NFD" on every input, even if the input were already in
NFD. This did have a significant performance hit, since we were
handling (Unicode) strings throughout the app.

I think that next time I write a similar API, I wll deal with
(string+bool) pairs, instead of plain strings, with the bool  meaning
"already normalised". This would definitely speed things up. Of course,
for any strings coming in from "outside", I'd still have to assume they
were not normalised, just in case.

Jill





Re: Normalisation in XML, HTML etc

2003-10-15 Thread John Cowan
Peter Kirk scripsit:

> I have heard it mentioned in general terms that W3C has specified that 
> text should be normalised according to NFC. What actually is the scope 
> of this specification? Does it apply to all XML, HTML etc? Is it 
> mandatory or just a recommendation?

It is not mandatory.  It is a SHOULD, which is between MUST (mandatory)
and MAY (permissive); it means that "there may exist valid reasons
in particular circumstances to ignore a particular item, but the full
implications must be understood and carefully weighed before choosing
a different course."

XML 1.0 is silent on the subject.  XML 1.1 (not yet finalized) says
that XML parsers SHOULD (in the sense above) verify that their input is
normalized, and explains exactly what "normalized" means in connection
with various XML constructs; for example, the character just after a
start-tag SHOULD not be a combining character.

> I would also like to know if this is actually applied or enforced by 
> products such as OpenOffice and Microsoft Office 2003 which use XML as 
> one of their native document formats. Will text saved in these formats 
> be normalised to NFC? Should it be?

Output SHOULD be normalized; input SHOULD be verified as normalized,
but not forcibly normalized (doing so is a security hole).  Whether
any particular product does this is up to the people who make the
product, and I have no information on either of those.

-- 
One art / There is  John Cowan <[EMAIL PROTECTED]>
No less / No more   http://www.reutershealth.com
All things / To do  http://www.ccil.org/~cowan
With sparks / Galore -- Douglas Hofstadter



Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Peter Kirk
I note the following text from section 5.13, p.127, of the Unicode 
standard v.4:

Canonical equivalence must be taken into account in rendering multiple 
accents, so that any two canonically equivalent sequences display as 
the same. This is particularly important when the canonical order is 
not the customary keyboarding order, which happens in Arabic with 
vowel signs, or in Hebrew with points. In those cases, a rendering 
system may be presented with either the typical typing order or the 
canonical order resulting from normalization, ...

Rendering systems should handle any of the canonically equivalent 
orders of combining
marks. This is not a performance issue: The amount of time necessary 
to reorder combining
marks is insignificant compared to the time necessary to carry out 
other work required
for rendering.


The word "must" is used here. But this is part of the "Implementation 
Guidelines" chapter which is generally not normative. Should this 
sentence with "must" be considered mandatory, or just a recommendation 
although in certain cases a "particularly important" one?

The conformance chapter does state the following, p.82, which can be 
understood as implying the same thing, and refers to section 5.13 in a 
way which suggests that the "information" there is relevant to conformance:

If combining characters have different combining classes... then no 
distinction of graphic form or semantic will result. This principle 
can be crucial for the correct appearance of combining characters. For 
more information, see “Canonical Equivalence” in /Section 5.13, 
Rendering Nonspacing Marks./


Does everyone agree that "This is not a performance issue"?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Normalisation in XML, HTML etc

2003-10-15 Thread Peter Kirk
I have heard it mentioned in general terms that W3C has specified that 
text should be normalised according to NFC. What actually is the scope 
of this specification? Does it apply to all XML, HTML etc? Is it 
mandatory or just a recommendation?

I would also like to know if this is actually applied or enforced by 
products such as OpenOffice and Microsoft Office 2003 which use XML as 
one of their native document formats. Will text saved in these formats 
be normalised to NFC? Should it be?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: [OT] Adelphia mojibake problem solved

2003-10-15 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]>
> Thanks to the advice of Unicode list members, I finally added the
> necessary setting to my Web site that overrides Adelphia's blind
> propensity to serve all pages, even correctly tagged UTF-8 pages, as ISO
> 8859-1.  So my site is back in working order, with all non-Basic Latin
> characters displaying properly.
>
> Following the advice provided by James Kass and Richard Ishida, I added
> a file called .htaccess that contains the following line:
>
> AddType "text/html; charset=UTF-8" html
>
> (The double quotes here apparently work as well as Richard's suggested
> single quotes.)  I didn't even have to petition Adelphia for any special
> FileInfo permission, which is a good thing.

The standard default setting is normally:
 AddType "text/html" html
without a charset indicator.

With this setting, you are forcing _all_ HTML pages to be declared with
UTF-8.
If this is true for your site, then that's good. But if you need to have
some pages
declared differently (for example when showing sample pages encoded with
"shift_jis"), you'll get another similar problem...

I don't know which webserver they use, but recent versions of Apache can
read
and interpret the content of HTML pages to autodetect the UTF forms or use
the  tags to set or change additional HTTP headers,
according
to what authors desired on their pages. Same thing for XML files that are
sent
according to the charset found in the leading XML declaration line.