Re: Corrigendum #9

2014-07-14 Thread Karl Williamson

I ran across this in Section 3.7.4 of
http://www.unicode.org/reports/tr36/

"Use pairs of noncharacter code points in the range FDD0..FDEF. These 
are "super" private-use characters, and are discouraged for general 
interchange. The transformation would take each nibble of a byte Y, and 
add to FDD0 and FDE0, respectively. However, noncharacter code points 
may be replaced by U+FFFD ( � ) REPLACEMENT CHARACTER by some 
implementations, especially when they use them internally. (Again, 
incoming characters must never be deleted, because that can cause 
security problems.)"


I'm not sure if this affects the calculus of the Corrigendum.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-04 Thread Richard COOK
On Jul 3, 2014, at 1:48 PM, Asmus Freytag  wrote:

> On 7/3/2014 11:02 AM, Richard COOK wrote:
>> On Jul 2, 2014, at 8:02 AM, Karl Williamson  wrote:
>> 
>>> Corrigendum #9 has changed this so much that people are coming to me and 
>>> saying that inputs may very well have non-characters, and that the default 
>>> should be to pass them through.  Since we have no published wording for how 
>>> the TUS will absorb Corrigendum #9, I don't know how this will play out.  
>>> But this abrupt a change seems wrong to me, and it was done without public 
>>> input or really adequate time to consider its effects.
>> Asmus,
>> 
>> I think you will recall that in late 2012 and early 2013, when the subject 
>> of the proposed changes (or clarifications) to text relating to 
>> noncharacters first arose, we (at Wenlin) expressed our concerns. Some 
>> concerns were grave, and some of the discussion and comments were captured 
>> in this web page:
>> 
>> 
>> 
>> There was much back and forth on the editorial list. Discussion clarified 
>> some of the issues for me, and mollified some of my concerns.
>> 
>> At that time we did implement support for noncharacters in Wenlin, 
>> controlled by an Advanced Option to:
>> 
>>  Replace noncharacters with [U+FFFD]
>> 
>> This user preference is turned on by default.
>> 
>> Not sure if revisiting any of our prior discussion would help clarify the 
>> evolution of thinking on this issue.
>> 
>> But I did want to mention that the comment “without public input” is not 
>> quite correct.
> 
> Richard,
> 
> "public input" is best understood as PRI or similar process, not discussions 
> by members or other people closely associated with the project.  Also, in 
> particular, discussions on the editorial list are invisible to the public.

Asmus,

The document (L2/13-015, see link above) which we submitted to UTC in response 
to the original proposal (L2/13-006) advocated caution. When L2/13-006 came to 
our attention it was perhaps rather late in the game (as Karl suggests in his 
reply). The changes were perhaps already a foregone conclusion in the minds of 
the proposers. I don’t recall if anyone even proposed doing a PRI, but in 
retrospect that would have been good a idea, a PRI would have been ideal and 
someone should have suggested it.

> 
>> As is so often the case, and as the web page above shows, there was input 
>> and discussion. Whether the amount of time given to this was really adequate 
>> is another question. Work required may expand to fill the available time, 
>> and perhaps more time is now available.
> 
> Given the wide ranging nature of implementations this "clarification" 
> affected, I believe the process failed to provide the necessary safeguards.
> 
> Conformance changes are really significant, and a Corrigendum, no matter how 
> much it is presented as harmless clarification, does affect conformance.
> 
> The UTC would be well served to formally adopt a process that requires a PRI 
> as well as resolutions taken at two separate UTCs to approve any Corrigendum.
> 
> There are changes to properties and algorithms that would also benefit from 
> such an extended process that has a guaranteed minimum number of times for 
> the change to be debated, to surface in minutes and to surface in calls for 
> public input, rather than sailing quietly and quickly into the standard.
> 
> The threshold for this should really be rather low -- as the standard has 
> matured, the number and nature of implementations that depend on it have 
> multiplied, to the point where even a diverse membership is no guarantee that 
> issues can be correctly identified and averted.
> 
> With the minutes from the UTC only recording decisions, one change, to 
> require an initial and a confirming resolution at separate meetings would 
> allow more issues to surface. It would also help if proposal documents were 
> updated to reflect the initial discussion, much as it is done with character 
> encoding proposals that are updated to address additional concerns identified 
> or resolved.
> 
> That said, I could imagine a possible exception for true errata (typos), 
> where correcting a clear mistake should not be unnecessarily drawn out, so 
> the error can be removed promptly. Such cases usually are turning on facts 
> (was there an editing mistake, was there new data about how a character is 
> used that makes an original property assignment a mistake (rather than a less 
> than optimal choice).
> 
> Despite being called a "clarification" this corrigendum is not in the nature 
> of an erratum.

So, there can be a continuum of cases between erratum and corrigendum. 
Corrigenda are at the severe end of the spectrum. It should be harder to issue 
a corrigendum since this affects conformance. 

Gray areas and thresholds dictate transparent process and caution. Judgement 
calls in critical cases require entertaining more opinions, and more second 
g

Re: Corrigendum #9

2014-07-03 Thread Karl Williamson

On 07/03/2014 02:48 PM, Asmus Freytag wrote:

On 7/3/2014 11:02 AM, Richard COOK wrote:

On Jul 2, 2014, at 8:02 AM, Karl Williamson 
wrote:


Corrigendum #9 has changed this so much that people are coming to me
and saying that inputs may very well have non-characters, and that
the default should be to pass them through.  Since we have no
published wording for how the TUS will absorb Corrigendum #9, I don't
know how this will play out.  But this abrupt a change seems wrong to
me, and it was done without public input or really adequate time to
consider its effects.

Asmus,

I think you will recall that in late 2012 and early 2013, when the
subject of the proposed changes (or clarifications) to text relating
to noncharacters first arose, we (at Wenlin) expressed our concerns.
Some concerns were grave, and some of the discussion and comments were
captured in this web page:



There was much back and forth on the editorial list. Discussion
clarified some of the issues for me, and mollified some of my concerns.

At that time we did implement support for noncharacters in Wenlin,
controlled by an Advanced Option to:

Replace noncharacters with [U+FFFD]

This user preference is turned on by default.

Not sure if revisiting any of our prior discussion would help clarify
the evolution of thinking on this issue.

But I did want to mention that the comment “without public input” is
not quite correct.


Richard,

"public input" is best understood as PRI or similar process, not
discussions by members or other people closely associated with the
project.  Also, in particular, discussions on the editorial list are
invisible to the public.



As is so often the case, and as the web page above shows, there was
input and discussion. Whether the amount of time given to this was
really adequate is another question. Work required may expand to fill
the available time, and perhaps more time is now available.


Given the wide ranging nature of implementations this "clarification"
affected, I believe the process failed to provide the necessary safeguards.

Conformance changes are really significant, and a Corrigendum, no matter
how much it is presented as harmless clarification, does affect
conformance.

The UTC would be well served to formally adopt a process that requires a
PRI as well as resolutions taken at two separate UTCs to approve any
Corrigendum.

There are changes to properties and algorithms that would also benefit
from such an extended process that has a guaranteed minimum number of
times for the change to be debated, to surface in minutes and to surface
in calls for public input, rather than sailing quietly and quickly into
the standard.

The threshold for this should really be rather low -- as the standard
has matured, the number and nature of implementations that depend on it
have multiplied, to the point where even a diverse membership is no
guarantee that issues can be correctly identified and averted.

With the minutes from the UTC only recording decisions, one change, to
require an initial and a confirming resolution at separate meetings
would allow more issues to surface. It would also help if proposal
documents were updated to reflect the initial discussion, much as it is
done with character encoding proposals that are updated to address
additional concerns identified or resolved.

That said, I could imagine a possible exception for true errata (typos),
where correcting a clear mistake should not be unnecessarily drawn out,
so the error can be removed promptly. Such cases usually are turning on
facts (was there an editing mistake, was there new data about how a
character is used that makes an original property assignment a mistake
(rather than a less than optimal choice).

Despite being called a "clarification" this corrigendum is not in the
nature of an erratum.

A./


Exactly.  There should have been a PRI before this was approved.  I read 
the unicore list, and I was not aware of the change until after the 
fact.  The first sentence of your more contemporaneous web page

http://wenlininstitute.org/UnicodeNoncharacters/
indicates that you too did not know about this until after the fact, and 
undertook this effort upon finding out about it to understand the 
magnitude and cope with the change, which as Asmus said, is indeed a 
change and not a clarification.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-03 Thread Asmus Freytag

On 7/3/2014 11:02 AM, Richard COOK wrote:

On Jul 2, 2014, at 8:02 AM, Karl Williamson  wrote:


Corrigendum #9 has changed this so much that people are coming to me and saying 
that inputs may very well have non-characters, and that the default should be 
to pass them through.  Since we have no published wording for how the TUS will 
absorb Corrigendum #9, I don't know how this will play out.  But this abrupt a 
change seems wrong to me, and it was done without public input or really 
adequate time to consider its effects.

Asmus,

I think you will recall that in late 2012 and early 2013, when the subject of 
the proposed changes (or clarifications) to text relating to noncharacters 
first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, 
and some of the discussion and comments were captured in this web page:



There was much back and forth on the editorial list. Discussion clarified some 
of the issues for me, and mollified some of my concerns.

At that time we did implement support for noncharacters in Wenlin, controlled 
by an Advanced Option to:

Replace noncharacters with [U+FFFD]

This user preference is turned on by default.

Not sure if revisiting any of our prior discussion would help clarify the 
evolution of thinking on this issue.

But I did want to mention that the comment “without public input” is not quite 
correct.


Richard,

"public input" is best understood as PRI or similar process, not 
discussions by members or other people closely associated with the 
project.  Also, in particular, discussions on the editorial list are 
invisible to the public.




As is so often the case, and as the web page above shows, there was input and 
discussion. Whether the amount of time given to this was really adequate is 
another question. Work required may expand to fill the available time, and 
perhaps more time is now available.


Given the wide ranging nature of implementations this "clarification" 
affected, I believe the process failed to provide the necessary safeguards.


Conformance changes are really significant, and a Corrigendum, no matter 
how much it is presented as harmless clarification, does affect conformance.


The UTC would be well served to formally adopt a process that requires a 
PRI as well as resolutions taken at two separate UTCs to approve any 
Corrigendum.


There are changes to properties and algorithms that would also benefit 
from such an extended process that has a guaranteed minimum number of 
times for the change to be debated, to surface in minutes and to surface 
in calls for public input, rather than sailing quietly and quickly into 
the standard.


The threshold for this should really be rather low -- as the standard 
has matured, the number and nature of implementations that depend on it 
have multiplied, to the point where even a diverse membership is no 
guarantee that issues can be correctly identified and averted.


With the minutes from the UTC only recording decisions, one change, to 
require an initial and a confirming resolution at separate meetings 
would allow more issues to surface. It would also help if proposal 
documents were updated to reflect the initial discussion, much as it is 
done with character encoding proposals that are updated to address 
additional concerns identified or resolved.


That said, I could imagine a possible exception for true errata (typos), 
where correcting a clear mistake should not be unnecessarily drawn out, 
so the error can be removed promptly. Such cases usually are turning on 
facts (was there an editing mistake, was there new data about how a 
character is used that makes an original property assignment a mistake 
(rather than a less than optimal choice).


Despite being called a "clarification" this corrigendum is not in the 
nature of an erratum.


A./


-Richard




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-03 Thread Richard COOK
On Jul 2, 2014, at 8:02 AM, Karl Williamson  wrote:

> Corrigendum #9 has changed this so much that people are coming to me and 
> saying that inputs may very well have non-characters, and that the default 
> should be to pass them through.  Since we have no published wording for how 
> the TUS will absorb Corrigendum #9, I don't know how this will play out.  But 
> this abrupt a change seems wrong to me, and it was done without public input 
> or really adequate time to consider its effects.

Asmus,

I think you will recall that in late 2012 and early 2013, when the subject of 
the proposed changes (or clarifications) to text relating to noncharacters 
first arose, we (at Wenlin) expressed our concerns. Some concerns were grave, 
and some of the discussion and comments were captured in this web page:



There was much back and forth on the editorial list. Discussion clarified some 
of the issues for me, and mollified some of my concerns.

At that time we did implement support for noncharacters in Wenlin, controlled 
by an Advanced Option to:

Replace noncharacters with [U+FFFD]

This user preference is turned on by default.

Not sure if revisiting any of our prior discussion would help clarify the 
evolution of thinking on this issue.

But I did want to mention that the comment “without public input” is not quite 
correct. As is so often the case, and as the web page above shows, there was 
input and discussion. Whether the amount of time given to this was really 
adequate is another question. Work required may expand to fill the available 
time, and perhaps more time is now available.

-Richard




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-02 Thread Richard Wordingham
On Wed, 2 Jul 2014 21:19:16 +0200
Philippe Verdy  wrote:

> 2014-07-02 20:19 GMT+02:00 David Starner :
> 
> > I might argue b for 0x00 in UTF-8 would be technically
> > legal
 
> But the same C libraries are also using -1 as end-of-stream values
> and if they are converted to bytes, they will be undistinctable from
> the NULL character that could be stored everywhere in the stream.

A 0xFF byte in a narrow character stream is converted to 0x00FF (int is
at least 16 bits wide) in the interfaces while the narrow character
end-of-stream value EOF is required to be negative.  Unfortunately, the
wide character end-of-stream marker WEOF is not required to be
negative, but it is not allowed to be a representable character.  C
appears to prohibit U+ as well as supplementary characters if
wchar_t is only 16 bits wide.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-02 Thread Philippe Verdy
2014-07-02 20:19 GMT+02:00 David Starner :

> I might argue b for 0x00 in UTF-8 would be technically
> legal

It is not. UTF-8 specifies the  effective value of each 8-bit byte, if you
store b in that byte you have exactly the same result as when
storing 0xFF or -1 (unless your system uses "bytes" larger than 8-bits (the
time of PDP mainframes with 8-bit bytes is over since long, all devices
around use 8-bit byte values on their interface, even if they may
internally encode exposed bits with longer sequences, such as with MFM
encodings, or by adding extra control and clock/sync bits, or could use
three rotating sequences of 3 states with automatic synchronization by
negative or positive transitions at every encoded bit position, plus some
breaking rules on some bits to find start of packets)

the standard never specifies which bit sequences correspond to
> which byte values--but \xC0\x80 would probably be more reliably
> processed by existing code.


But the same C libraries are also using -1 as end-of-stream values and if
they are converted to bytes, they will be undistinctable from the NULL
character that could be stored everywhere in the stream.

The main reason why 0xC0,0x80 was chosen instead of 00 is historic in Java
when its JNI interface only used strings encoded on 8-bit sequences without
a separate parameter to specify the length of the encoded sequence. 0x00
was then used like in the basic ANSI C string library (string.h and
stdio.h) and Java was ported on heterogeneous systems (including those
small devices whose "int" type was also 8-bit only, blocking the use of
BOTH 0x00 and 0xFF in some system I/O APIs).

At least 0xC0,0x80 was safe (and not used by UTF-8, but at that time UTF-8
was still not even a standard previsely defined, and it was legal to
represent U+ as 0xC0,0x80, the prohibition of over long sequences in
UTF-8 or Unicode came many years later, Java used the early
informative-only RFC specification, which was also supported by ISO, before
ISO1646-1 and Unicode 1.1 were aligned).

The Unicode and ISO1646 have changed (both in incompatible way) but it was
necessary to have both standards compatible with each other. Java could not
change its ABI for JNI, it was too late.

However Java added another UTF16-based interface for strings to JNI. But
still this interface does not enforce UTF-16 rules about paired surrogates
(just like C, C++ or even Javascript). But the added 16-bit string
interface for JNI has a separate field for storing the encoded sting length
(in 16-but code units), so that interface uses the standard 0x value
for U+. As much as possible JNI extension ibraries should use that
16-bit interface (which is simpler to handle also with modern OS APIs
compatible woth Unicode, notably on Windows). But the 8-bit iJNI interface
is still commonly used in JNI extension libraries for Unix/Linux (because
it is safer to handle the conversion from 16-bit to 8-bit in the JVM than
in the external JNI library using its own memory allocation and unable to
use the garbage collector of the managed memory of the JVM).

The Java-modified-UTF8 encoding is still used in the binary encoding of
compiled class files (this is invisible to applications that only see
16-bit encoded strings, unless they have to parse or generate compiled
class files)
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-02 Thread David Starner
On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson  wrote:
> In
> UTF-8, an example would be that Sun, I'm told, and for reasons I've
> forgotten or never knew, did not want raw NUL bytes to appear in text
> streams, so used the overlong sequence \xC0\x80 to represent them; overlong
> sequences generally being considered "bad" because they could be used to
> insert malicious payloads into the input.

In C, NUL ends a string. If you have to run data that may have NUL
characters through C functions, you can't store the NULs as \0. I
might argue b for 0x00 in UTF-8 would be technically
legal--the standard never specifies which bit sequences correspond to
which byte values--but \xC0\x80 would probably be more reliably
processed by existing code.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-02 Thread Asmus Freytag

On 7/2/2014 8:02 AM, Karl Williamson wrote:
Corrigendum #9 has changed this so much that people are coming to me 
and saying that inputs may very well have non-characters, and that the 
default should be to pass them through.  Since we have no published 
wording for how the TUS will absorb Corrigendum #9, I don't know how 
this will play out.  But this abrupt a change seems wrong to me, and 
it was done without public input or really adequate time to consider 
its effects.


Non-characters are still designed solely for internal use, and hence I 
think the default for a gatekeeper should still be to exclude them.


This is the crux of this issue.

The Corrigendum was introduced with the intent to allow users to lean on 
library and tool writers to adopt a permissive attitude - by removing 
what many among the developers of such software had seen as language 
that endorsed or even encouraged strong filtering.



On 06/12/2014 11:14 PM, Peter Constable wrote:

I get the impression that you think that Unicode conformance 
requirements have historically provided that guarantee, and that 
Corrigendum #9 broke that. If so, then that is a mistaken 
understanding of Unicode conformance. 


Not so much an issue of "guarantee", but language that was treating 
strong filtering as the default, and that was understood as such in the 
community.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-07-02 Thread Karl Williamson

On 06/12/2014 11:14 PM, Peter Constable wrote:

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson
Sent: Wednesday, June 11, 2014 9:30 PM


I have a something like a library that was written a long time ago
(not by me) assuming that noncharacters were illegal in open interchange.
Programs that use the library were guaranteed that they would not receive
noncharacters in their input.


I haven't read every post in the thread, so forgive me if I'm making incorrect 
inferences.

I get the impression that you think that Unicode conformance requirements have 
historically provided that guarantee, and that Corrigendum #9 broke that. If 
so, then that is a mistaken understanding of Unicode conformance.


Any real-world application dealing with Unicode inputs needs to be 
protected from "bad" inputs.  These can come in the form of malicious 
attacks, or the result of a noisy transmission, or just plain mistakes. 
 It doesn't matter.  Generally, a gatekeeper application is employed to 
furnish this protection, so that the other application doesn't have to 
keep checking things at every turn.  And, since software is expensive to 
write and prone to error, a generic gatekeeper is usually used, shared 
among many applications.  Such a gatekeeper may very well be 
configurable to let through some inputs that would normally be 
considered bad, to accommodate rare special cases.  In UTF-8, an example 
would be that Sun, I'm told, and for reasons I've forgotten or never 
knew, did not want raw NUL bytes to appear in text streams, so used the 
overlong sequence \xC0\x80 to represent them; overlong sequences 
generally being considered "bad" because they could be used to insert 
malicious payloads into the input.


The original wording of the non-character text "should never be 
interchanged" doesn't necessarily indicate that they will never be valid 
in input, but that their appearance there purposely would be something 
quite rare, and a gatekeeper application should default to not passing 
them through.  A protected application could indicate to the gatekeeper 
that it is prepared to handle non-character inputs, but the default 
should be to not accept them.


Corrigendum #9 has changed this so much that people are coming to me and 
saying that inputs may very well have non-characters, and that the 
default should be to pass them through.  Since we have no published 
wording for how the TUS will absorb Corrigendum #9, I don't know how 
this will play out.  But this abrupt a change seems wrong to me, and it 
was done without public input or really adequate time to consider its 
effects.


Non-characters are still designed solely for internal use, and hence I 
think the default for a gatekeeper should still be to exclude them.




Here is what has historically been said in the way of conformance requirements 
related to non-characters:

TUS 1.0: There were no conformance requirements stated. This recommendation was 
given:
"U+ and U+FFFE are reserved and should not be transmitted or stored."

This same recommendation was repeated in later versions. However, it must be recognized 
that "should" statements are never absolute requirements.

Conformance requirements first appeared in TUS 2.0:

TUS 2.0, TUS 3.0:
"C5A process shall not interpret either U+FFFE or U+ as an abstract 
character."


TUS 4.0:
"C5A process shall not interpret a noncharacter code point as an abstract 
character."

"C10   When a process purports not to modify the interpretation of a valid coded 
character representation, it shall make no change to that coded character representation 
other than the possible replacement of character sequences by their canonical-equivalent 
sequences or the deletion of noncharacter code points."

Btw, note that C10 makes the assumption that a valid coded character sequence 
can include non-character code points.


TUS 5.0 (trivially different from TUS4.0):
C2 = TUS4.0, C5

"C7When a process purports not to modify the interpretation of a valid coded 
character sequence, it shall make no change to that coded character sequence other than 
the possible replacement of character sequences by their canonical-equivalent sequences 
or the deletion of noncharacter code points."


TUS 6.0:
C2 = TUS5.0, C2

"C7When a process purports not to modify the interpretation of a valid 
coded character
sequence, it shall make no change to that coded character sequence other than 
the possible
replacement of character sequences by their canonical-equivalent sequences."

Interestingly, the change to C7 does not permit non-characters to be replaced 
or removed at all while claiming not to have left the interpretation intact.


So, there was a change in 6.0 that could impact conformance claims of existing 
implementations. But there has never been any guarantees made _by Unicode_ that 
non-character code points will never occur in open interchange. Interchange has 
always be

Re: Corrigendum #9

2014-06-26 Thread Doug Ewell
Richard Wordingham  wrote:

> At present there is no certainty as to whether
> an interchanged file in the UTF-16 encoding scheme that appears to
> contain a BOM contains a BOM or starts with U+FFFE. The only
> promise is that such a file contains an even number of data bytes.
> Any such sequence is valid! Will the UTF-16 encoding scheme be
> withdrawn?

One might wonder, given how frequently we hear that unpaired surrogates
also occur in the wild and need to be tolerated.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-26 Thread CE Whitehead
From: Richard Wordingham  Date: Wed, 25 Jun 
2014 18:58:55 +0100On Tue, 24 Jun 2014 09:16:00 -0400 
> CE Whitehead  wrote: >> ME: if two sequences are 
> canonically equivalent except that one has >> noncharacters in it, are these 
> still canonically equivalent? > Canonical equivalences are defined for all 
> sequences of scalar values; > it is just that it changes from version to 
> version for most unassigned > characters. > Non-characters only decompose to 
> themselves and do not > occur in the canonical (or indeed compatibility) 
> decomposition of > anything else, so a sequence containing a non-character 
> cannot be > canonically equivalent to a seqeunce not containing a 
> non-character.
My mistake, it's not "canonical equivalence" that Peter was talking about but 
"conformance" to standard,so that a process can claim a character sequence is 
the same character sequence as that which was passed to it.(Thus I assume that 
a process can treat these two sequences (containing canonically equivalent 
characters but one with noncharacters) as different character sequences but 
does not have to do so.)
Best,
--C. E. whiteheadcewcat...@hotmail.com




--from Maria de Ventadorn,  12th century
  ___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-25 Thread Richard Wordingham
On Tue, 24 Jun 2014 09:16:00 -0400
CE Whitehead  wrote:

> ME: if two sequences are canonically equivalent except that one has
> noncharacters in it, are these still canonically equivalent?

Canonical equivalences are defined for all sequences of scalar values;
it is just that it changes from version to version for most unassigned
characters.

Non-characters only decompose to themselves and do not
occur in the canonical (or indeed compatibility) decomposition of
anything else, so a sequence containing a non-character cannot be
canonically equivalent to a seqeunce not containing a non-character.

> Regarding the sentinels; I am an outsider but assume that with
> Corrigendum 9 U+FFFE will continue to be mentioned as having
> generally (not always?) standard use throughout; in Chapter 16.7 it
> is currently mentioned; I assume it will still be -- according to
> info. in the FAQ and elsewhere:
> http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit
> unsigned hexadecimal value U+FFFE is not a Unicode character value,
> and should be taken as a signal that Unicode characters should be
> byte-swapped before interpretation. U+FFFE should only be intepreted
> as an incorrectly byte-swapped version of U+FEFF" 

There is a lot of untruth in that FAQ entry, alas.  I think U+FFFE
and possibly U+ should be treated differently to the other 64
non-characters.  At present there is no certainty as to whether
an interchanged file in the UTF-16 encoding scheme that appears to
contain a BOM contains a BOM or starts with U+FFFE.  The only
promise is that such a file contains an even number of data bytes.
Any such sequence is valid!  Will the UTF-16 encoding scheme be
withdrawn?

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-24 Thread CE Whitehead


Markus Scherer said what sounds right to me to recommend (maybe what he says 
should be said in Corrigendum 9):
http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0148.html

From: Markus Scherer 
Date: Thu, 12 Jun 2014 01:37:49 -0700
> If your library makes an explict promise to remove noncharacters, then it
> should continue to do so.
> However, if your library is understood to pass through any strings, except
> for the advertised processing, then noncharacters should probably be
> preserved. 
ME: Am I to believe from the above, that,
regarding  www.unicode.org/L2/L2013/13015-nonchars.pdf (which rejects the bold 
interpretation but I don't think that's what Markus's email does) --
the "'bold interpretation' of internal exchange of noncharacters" may continue
where deletion of a noncharacter is never a good idea, and should not happen, 
that unrecognized noncharacters should simply be silently ignored then,
with  "all Unicode scalar values, including those corresponding to noncharacter 
code points and unassigned code points," thus "mapped to unique code unit 
sequences";
while, at the same time (albeit as I understand things only if the type of 
encoding is recognized),
noncharacters may replaced with the scalar for unassigned code points (U+FFFD)? 
In this latter case the non-character is no longer mapped one-to-one with a 
scalar as all noncharacters will have been replaced with U+FFFD. So is that 
one-to-one mapping recommendation going to be changed or not?


* * *
I also have a quesiton on Peter's notes on TUS 6.0 rule C7 (which followed the 
Unicode 4.0 correction apparently if I understand correctly; maybe I should 
have sent this question as a separate email) 

http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0151.html
From: Peter Constable 
Date: Fri, 13 Jun 2014 05:14:30 +
> TUS 6.0:
> C2 = TUS5.0, C2

"C7 When a process purports not to modify the interpretation of a valid coded 
character
sequence, it shall make no change to that coded character sequence other than 
the possible
replacement of character sequences by their canonical-equivalent sequences."

> Interestingly, the change to C7 does not permit non-characters to be replaced 
> or removed at all while claiming not to have left the interpretation intact.
ME: if two sequences are canonically equivalent except that one has 
noncharacters in it, are these still canonically equivalent? (just a wild 
question; would be nice to have an answer in the faq on noncharacters or 
somewhere; mabye I missed the answer and it was there).
*  * *
Sentinels, Security

Regarding the sentinels; I am an outsider but assume that with Corrigendum 9 
U+FFFE will continue to be mentioned as having generally (not always?) standard 
use throughout;
in Chapter 16.7 it is currently mentioned; I assume it will still be --
according to info. in the FAQ and elsewhere:
http://www.unicode.org/faq/private_use.html 
 "U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode 
character value, and should be taken as a signal that Unicode characters should 
be byte-swapped before interpretation. U+FFFE should only be intepreted as an 
incorrectly byte-swapped version of U+FEFF" 

Yes, it would be nice also to have info about security effects I agree of any 
other sentinels particularly U+ and U+10 
-- but I envision most security effects would be caused by removing without 
replacing one of these (is that right?)

Hope these questions are helpful.
Best,

--C. E. Whitehead
cewcat...@hotmail.com


  ___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-12 Thread Peter Constable
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson
Sent: Wednesday, June 11, 2014 9:30 PM

> I have a something like a library that was written a long time ago 
> (not by me) assuming that noncharacters were illegal in open interchange. 
> Programs that use the library were guaranteed that they would not receive 
> noncharacters in their input.

I haven't read every post in the thread, so forgive me if I'm making incorrect 
inferences. 

I get the impression that you think that Unicode conformance requirements have 
historically provided that guarantee, and that Corrigendum #9 broke that. If 
so, then that is a mistaken understanding of Unicode conformance.

Here is what has historically been said in the way of conformance requirements 
related to non-characters:

TUS 1.0: There were no conformance requirements stated. This recommendation was 
given:
"U+ and U+FFFE are reserved and should not be transmitted or stored."

This same recommendation was repeated in later versions. However, it must be 
recognized that "should" statements are never absolute requirements.

Conformance requirements first appeared in TUS 2.0:

TUS 2.0, TUS 3.0: 
"C5 A process shall not interpret either U+FFFE or U+ as an abstract 
character."


TUS 4.0:
"C5 A process shall not interpret a noncharacter code point as an abstract 
character."

"C10When a process purports not to modify the interpretation of a valid 
coded character representation, it shall make no change to that coded character 
representation other than the possible replacement of character sequences by 
their canonical-equivalent sequences or the deletion of noncharacter code 
points."

Btw, note that C10 makes the assumption that a valid coded character sequence 
can include non-character code points.


TUS 5.0 (trivially different from TUS4.0):
C2 = TUS4.0, C5

"C7 When a process purports not to modify the interpretation of a valid 
coded character sequence, it shall make no change to that coded character 
sequence other than the possible replacement of character sequences by their 
canonical-equivalent sequences or the deletion of noncharacter code points."


TUS 6.0:
C2 = TUS5.0, C2

"C7 When a process purports not to modify the interpretation of a valid 
coded character
sequence, it shall make no change to that coded character sequence other than 
the possible
replacement of character sequences by their canonical-equivalent sequences."

Interestingly, the change to C7 does not permit non-characters to be replaced 
or removed at all while claiming not to have left the interpretation intact. 


So, there was a change in 6.0 that could impact conformance claims of existing 
implementations. But there has never been any guarantees made _by Unicode_ that 
non-character code points will never occur in open interchange. Interchange has 
always been discouraged, but never prohibited.




Peter

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-12 Thread Richard Wordingham
On Thu, 12 Jun 2014 01:37:49 -0700
Markus Scherer  wrote:

> On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson
>  wrote:

> > The FAQ mentions using 0x7FFF as a possible sentinel.  I did not
> > realize that that was considered representable in any UTF.
> > Likewise -1.

> No, and that's the point of using those. Integer values that are not
> code points make for great sentinels in API functions, such as a
> next() iterator returning -1 when there is no next character.

They work fine as alternatives to scalar values.  They don't work so
well in 8-bit and 16-bit Unicode strings.  A general purpose routine
extracting scalar values from Unicode strings is likely to treat them
as errors rather than just returning the scalar value as it would for
a non-character.  The only way to use them directly in 8- and
16-bit Unicode strings is to deliberately create ill-formed Unicode
strings.

Thus, these 'sentinels' are not full blown sentinels like U+ in the
C conventions for 'strings', as opposed to arrays of char.

There is a get-out clause - just never accept that a Unicode string is
purported to be in a Unicode character encoding form.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-12 Thread David Starner
On Thu, Jun 12, 2014 at 1:37 AM, Markus Scherer  wrote:
> If your library makes an explict promise to remove noncharacters, then it
> should continue to do so.

There is rarely so much frustration as when a library or utility
changes behavior and the justification is that well-understood
practice was not explicit. I suspect few groups could bring the world
to a halt with work-to-rule as quick as programmers.

> I disagree. If svn or git choked on noncharacters or control codes or
> private use characters or unassigned code points etc., I would complain.
> Likewise, I expect to be able to use plain text or programming editors
> (gedit, kate, vi, emacs, Visual Studio) to handle files with such characters
> just fine.

I don't expect plain text editors to handle arbitrary control codes,
much less noncharacters, unless they really handle whatever binary
junk is shoved at them, which a generic plain text editor can not be
relied upon to do. I believe that programming editors should scream
bloody murder over noncharacters and unusual control codes; they have
no place in source code at all.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-12 Thread Markus Scherer
On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson 
wrote:

> I have a something like a library that was written a long time ago (not by
> me) assuming that noncharacters were illegal in open interchange. Programs
> that use the library were guaranteed that they would not receive
> noncharacters in their input.  They thus are free to use any noncharacter
> internally as they wish.  Now that Corrigendum #9 has come out, I'm getting
> requests to update the library to not reject noncharacters.  The library
> itself does not use noncharacters.  If I (or someone else) makes the
> requested change, it may silently cause security holes in those programs
> that were depending on it doing the rejection, and who upgrade to use the
> new version.
>

If your library makes an explict promise to remove noncharacters, then it
should continue to do so.
However, if your library is understood to pass through any strings, except
for the advertised processing, then noncharacters should probably be
preserved.

I don't see anything in the FAQ that really addresses this situation.  I
> think there should be an answer that addresses code written before the
> Corrigendum, and that goes into detail about the security issues. My guess
> is that the UTC did not really consider the potential for security holes
> when making this Corrigendum.
>

There is nothing really new in the corrigendum. The UTC felt that some
implementers had misinterpreted inconsistent and misleading statements in
and around the standard, and clarified the situation.

Any process that requires certain characters or sequences to not occur in
the input must explicitly check for those, regardless of whether they are
noncharacter, private use characters, unassigned code points, control
codes, deprecated language tag characters, discouraged stateful formatting
controls, stacks of hundreds of diacritics, or whatever.

In a sense, noncharacters are much like the old control codes. Some
terminals say "beep" when they see U+0007, or go into strange modes when
they see U+001B; on Windows, when you read a text file that contains
U+001A, it is interpreted as an end-of-file marker. If your process
depended on those things not happening, then you would have to strip those
control codes on input. But a pass-through-style library will be
universally expected not to do anything special with them.

I agree that CLDR should be able to use noncharacters for internal
> processing, and that they should be able to be stored in files and edited.
>  But I believe that version control systems and editors have just as much
> right to use noncharacters for their internal purposes.


I disagree. If svn or git choked on noncharacters or control codes or
private use characters or unassigned code points etc., I would complain.
Likewise, I expect to be able to use plain text or programming editors
(gedit, kate, vi, emacs, Visual Studio) to handle files with such
characters just fine.

I do *not* necessarily expect Word, OpenOffice, or Google Docs to handle
all of these.

Is CLDR constructed so there is no potential for conflicts here?  That is,
> does it reserve certain noncharacters for its own use?
>

I believe that CLDR only uses noncharacters for special purposes in
collation. In CLDR data files, there are at most contraction mappings that
start with noncharacters for purposes of building alphabetic-index tables.
(And those noncharacters are \u-escaped in CLDR XML files since CLDR 24.)
There is no mechanism to remove them from any input, but the worst thing
that would happen is that you get a sequence of code points to sort
interestingly.

The FAQ mentions using 0x7FFF as a possible sentinel.  I did not
> realize that that was considered representable in any UTF.  Likewise -1.
>

No, and that's the point of using those. Integer values that are not code
points make for great sentinels in API functions, such as a next() iterator
returning -1 when there is no next character.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-11 Thread Karl Williamson

On 06/02/2014 09:48 AM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell mailto:d...@ewellic.org>> wrote:

I suspect everyone can agree on the edge cases, that noncharacters are
harmless in internal processing, but probably should not appear in
random text shipped around on the web.


Right, in principle. However, it should be ok to include noncharacters
in CLDR data files for processing by CLDR implementations, and it should
be possible to edit and diff and version-control and web-view those
files etc.

It seems that trying to define "interchange" and "public" in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be
used, should be preserved, or could be stripped, starting with "Q: Are
noncharacters intended for interchange?
"

In my view, those Q/A pairs explain noncharacters quite well. If there
are further examples of where noncharacters might be used, should be
preserved, or could be stripped, and that would be particularly useful
to add to the examples already there, then we could add them.

markus




I was unaware of this FAQ.  Having read it and re-read this entire 
thread, I am still troubled.


I have a something like a library that was written a long time ago (not 
by me) assuming that noncharacters were illegal in open interchange. 
Programs that use the library were guaranteed that they would not 
receive noncharacters in their input.  They thus are free to use any 
noncharacter internally as they wish.  Now that Corrigendum #9 has come 
out, I'm getting requests to update the library to not reject 
noncharacters.  The library itself does not use noncharacters.  If I (or 
someone else) makes the requested change, it may silently cause security 
holes in those programs that were depending on it doing the rejection, 
and who upgrade to use the new version. Some of these programs may have 
been written many years ago.  The original authors are now dead in some 
instances, or have turned the code over to someone else, or haven't 
thought about it in years.  The current maintainers of those programs 
may be unaware of this dependence, and hence may upgrade without 
realizing the consequences.  Further, the old versions of the library 
will soon be unsupported, so there is pressure to upgrade to get bug 
fixes and the promise of future support.  This means there could be 
security holes that a hacker who gets a hold of the source can exploit.


I don't see anything in the FAQ that really addresses this situation.  I 
think there should be an answer that addresses code written before the 
Corrigendum, and that goes into detail about the security issues. My 
guess is that the UTC did not really consider the potential for security 
holes when making this Corrigendum.


I agree that CLDR should be able to use noncharacters for internal 
processing, and that they should be able to be stored in files and 
edited.  But I believe that version control systems and editors have 
just as much right to use noncharacters for their internal purposes.  I 
disagree with the FAQ that seems to say if you write a utility you 
should avoid using noncharacters in its implementation.  It might be 
that competitive pressure, or just that the particular implementations 
don't need non-characters, would cause some such utilities to accept 
some or all non-characters as inputs.  But If I were writing such code, 
I can see now how using noncharacters for my purposes would be quite 
convenient.  CLDR could be considered to be a utility, and its users 
might want to use noncharacters for their purposes.  Is CLDR constructed 
so there is no potential for conflicts here?  That is, does it reserve 
certain noncharacters for its own use?


The FAQ talks about how various now-noncharacter code points were touted 
as sentinel candidates in earlier Unicode versions, and that they are no 
longer so.  But it really should emphasize that old code may very well 
want to continue to use them as sentinels.  The answer "Well, the short 
answer is no, that is not true—at least, not entirely true."  is 
misleading in this regard.


The FAQ mentions using 0x7FFF as a possible sentinel.  I did not 
realize that that was considered representable in any UTF.  Likewise -1.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-08 Thread Shawn Steele
> I should note that this front-end to 'diff' changes the input files, writes 
> the modified versions out, and calls 'diff' with those modified files as its 
> inputs.  By using noncharacters, it would be depending on 'diff' to 1) not 
> use them, and 2) to not filter them out, and 3) for the system to be able to 
> store and retrieve them in files.

In my view that is still "internal" to your apps use of these characters :)

The original text doesn't say that my application cannot store & retrieve them 
from files for internal use.  On the contrary, I'd expect proprietary formats 
for internal use to require that.  I agree that the original text is a bit 
vague on the question of tools to inspect/modify/whatever your internal use.

-Shawn

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-08 Thread Karl Williamson

On 06/07/2014 10:33 PM, Asmus Freytag wrote:

On 6/7/2014 9:19 PM, Karl Williamson wrote:

On 06/02/2014 11:00 AM, Shawn Steele wrote:

To further my understanding, can someone provide examples of how
these are used in actual practice?  I can't think of any offhand and
the closest I get is like the old escape characters to get a dot
matrix printer to shift modes, or old word processor internal
formatting sequences.



Here's an example of a possible use.  20 some years ago I wrote a
front-end to the Unix diff utility.  Showing the differences between
files (usually 2 versions of the same program's code) is an extremely
common programming activity.  I do it many times a day.  One reason is
to try to find out why a bug has crept in.  In doing so, there are
some differences that are not relevant to the task at hand, and their
being shown is a significant distraction. For example, in programming,
one might have renamed a variable (identifier) because its purpose has
changed somewhat and the name should accurately reflect its new
function so the reader is not subconsciously misled.  It would be nice
to be able to suppress the variable name changes from the difference
display. There could be thousands of them.  By changing the name in
each file version to the same noncharacter during the diff, these
differences won't be displayed, and there would not be any possible
conflict with the input files having that noncharacter in them.  (For
display the noncharacter is changed back to the original value in its
respective file)  Further, one might want to ignore the name changes
of two variables.  Just use a second noncharacter, up to 66.

I wrote this long before noncharacters were available.  What I do
instead is scan the files for rarely used characters until I find
enough ones that aren't in the files.  For example U+9F is unlikely to
appear.  Scanning the files takes time.  This step could be omitted
for noncharacters that are known to be illegal in the input.



This "illegal in the input" so "I'm free to assume I can use them for my
purposes" was definitely the primary(!) design goal discussed when the
set of 32 were added to Unicode. Having UTC backpedal from that, many
years after original design, based on a single meeting and without
public review is really a breakdown of the process.

A./


I should note that this front-end to 'diff' changes the input files, 
writes the modified versions out, and calls 'diff' with those modified 
files as its inputs.  By using noncharacters, it would be depending on 
'diff' to 1) not use them, and 2) to not filter them out, and 3) for the 
system to be able to store and retrieve them in files.


I think a revision to the text was advisable to clarify that 2) and 3) 
were acceptable.  I haven't heard anybody on this thread disagree with 
that.


But item 1) shows how tricky this issue really is.  My utility looks 
like a fancier 'diff' to those people who call it, so they would be 
justified in wanting it not to use noncharacters because they have their 
own purposes for them.  If some of those callers were themselves 
utilities, their callers might want to use noncharacters for their own 
purposes.  And so on and so on.


I don't have a good answer, except to say that Asmus' characterization 
above looks reasonable.


The purpose of public reviews is to try to get a broad range of ideas, 
and if none are forthcoming, then the fact that there was such a review 
should be an adequate defense of the ultimate decision.  Not holding a 
review is an invitation to lingering suspicions on the part of the 
public about the motives behind any such decision.  These can fester and 
the trust level is permanently diminished.  There will always be people 
who won't like the decision, and who will assume that the deciders are 
malevolent.  But the vast majority will accept a decision that seems to 
have been made in good faith after public input.


This is just how things work, no matter what the venue or issue.  It may 
be that the UTC thought this was minor enough to not require a review, 
but if so, time has shown that to have been an incorrect perception.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-07 Thread Asmus Freytag

On 6/7/2014 9:19 PM, Karl Williamson wrote:

On 06/02/2014 11:00 AM, Shawn Steele wrote:
To further my understanding, can someone provide examples of how 
these are used in actual practice?  I can't think of any offhand and 
the closest I get is like the old escape characters to get a dot 
matrix printer to shift modes, or old word processor internal 
formatting sequences.




Here's an example of a possible use.  20 some years ago I wrote a 
front-end to the Unix diff utility.  Showing the differences between 
files (usually 2 versions of the same program's code) is an extremely 
common programming activity.  I do it many times a day.  One reason is 
to try to find out why a bug has crept in.  In doing so, there are 
some differences that are not relevant to the task at hand, and their 
being shown is a significant distraction. For example, in programming, 
one might have renamed a variable (identifier) because its purpose has 
changed somewhat and the name should accurately reflect its new 
function so the reader is not subconsciously misled.  It would be nice 
to be able to suppress the variable name changes from the difference 
display. There could be thousands of them.  By changing the name in 
each file version to the same noncharacter during the diff, these 
differences won't be displayed, and there would not be any possible 
conflict with the input files having that noncharacter in them.  (For 
display the noncharacter is changed back to the original value in its 
respective file)  Further, one might want to ignore the name changes 
of two variables.  Just use a second noncharacter, up to 66.


I wrote this long before noncharacters were available.  What I do 
instead is scan the files for rarely used characters until I find 
enough ones that aren't in the files.  For example U+9F is unlikely to 
appear.  Scanning the files takes time.  This step could be omitted 
for noncharacters that are known to be illegal in the input.



This "illegal in the input" so "I'm free to assume I can use them for my 
purposes" was definitely the primary(!) design goal discussed when the 
set of 32 were added to Unicode. Having UTC backpedal from that, many 
years after original design, based on a single meeting and without 
public review is really a breakdown of the process.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-07 Thread Karl Williamson

On 06/02/2014 11:00 AM, Shawn Steele wrote:

To further my understanding, can someone provide examples of how these are used 
in actual practice?  I can't think of any offhand and the closest I get is like 
the old escape characters to get a dot matrix printer to shift modes, or old 
word processor internal formatting sequences.



Here's an example of a possible use.  20 some years ago I wrote a 
front-end to the Unix diff utility.  Showing the differences between 
files (usually 2 versions of the same program's code) is an extremely 
common programming activity.  I do it many times a day.  One reason is 
to try to find out why a bug has crept in.  In doing so, there are some 
differences that are not relevant to the task at hand, and their being 
shown is a significant distraction.  For example, in programming, one 
might have renamed a variable (identifier) because its purpose has 
changed somewhat and the name should accurately reflect its new function 
so the reader is not subconsciously misled.  It would be nice to be able 
to suppress the variable name changes from the difference display. 
There could be thousands of them.  By changing the name in each file 
version to the same noncharacter during the diff, these differences 
won't be displayed, and there would not be any possible conflict with 
the input files having that noncharacter in them.  (For display the 
noncharacter is changed back to the original value in its respective 
file)  Further, one might want to ignore the name changes of two 
variables.  Just use a second noncharacter, up to 66.


I wrote this long before noncharacters were available.  What I do 
instead is scan the files for rarely used characters until I find enough 
ones that aren't in the files.  For example U+9F is unlikely to appear. 
 Scanning the files takes time.  This step could be omitted for 
noncharacters that are known to be illegal in the input.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-04 Thread Mark Davis ☕️
The characters are present, but are escaped in the source for readability.

Here is a sample from collation/zh.xml:


...
  
...

  
  

Re: Corrigendum #9

2014-06-04 Thread Martin J. Dürst

On 2014/06/04 03:59, Richard Wordingham wrote:

On Tue, 03 Jun 2014 16:09:27 +0900
"Martin J. Dürst"  wrote:


I'd strongly suggest that completely independent of when and how
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets
worked out for how to get rid of these codepoints in CLDR data. The
sooner, the better.


I suspect this has already been done.  I know of no CLDR text files
still containing them.


Really great if that's true! Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Tue, 03 Jun 2014 16:09:27 +0900
"Martin J. Dürst"  wrote:

> I'd strongly suggest that completely independent of when and how 
> Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets
> worked out for how to get rid of these codepoints in CLDR data. The
> sooner, the better.

I suspect this has already been done.  I know of no CLDR text files
still containing them.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Asmus Freytag

Nicely put.

A./

On 6/3/2014 12:09 AM, "Martin J. Dürst" wrote:

On 2014/06/03 07:08, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 1:32 PM, David Starner mailto:prosfil...@gmail.com>> wrote:

I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect "handling these" in web browsers and lamebrained
utilities. I expect "treat like unassigned code points".


Expecting them to be treated like unassigned code points shows that 
their use is a bad idea: Since when does the Unicode Consortium use 
unassigned code points (and the like) in plain sight?



I can't shake the suspicion that Corrigendum #9 is not actually solving
a general problem, ...


I have to fully agree with Asmus, Richard, Shawn and others that the 
use of non-characters in CLDR is a very bad and dangerous example.


However convenient the misuse of some of these codepoints in CLDR may 
be, it sets a very bad example for everybody else. Unicode itself 
should not just be twice as careful with the use of its own 
codepoints, but 10 times as careful.


I'd strongly suggest that completely independent of when and how 
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets 
worked out for how to get rid of these codepoints in CLDR data. The 
sooner, the better.


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Asmus Freytag

On 6/2/2014 3:08 PM, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote:


I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect "handling these" in web browsers and lamebrained 
utilities. I expect "treat like unassigned code points".




I can't shake the suspicion that Corrigendum #9 is not actually 
solving a general problem, but is a special favor to CLDR as being run 
by insiders, and in the process muddying the waters for everyone else.


Clarifying:

I still haven't heard from anyone that this solves a general problem 
that is widespread. The only actual example has always been CLDR, and 
its decision to ship these code points in XML. Shipping these code 
points in files was pretty far down the list of "what not to do" when 
they were originally adopted. My view continues to be that this is was a 
questionable design decision by CLDR, given what was on the record. The 
reaction of several outside implementers during this discussion makes 
clear that viewing that design as problematic is not just my personal view.


Usually, if there's a discrepancy between an implementation and Unicode, 
the reaction is not to retract conformance language. I think arriving at 
this decision was easier for the UTC, because CLDR is not a random, 
unrelated implementation. And, as in any group, it's perhaps easier to 
not be as keenly aware of the impact on external implementations.


So, I'd like to clarify, that this is the sense in which I meant 
"special favor", and which therefore is not the most felicitous 
expression to describe what I had in mind.


A./



A./


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Philippe Verdy
I think his point is that an application may want to encapsulate in a valid
text any orbitrary stream of code points (including non characters, PUAs,
or isolated surrogate code units found in 16-bit or 32-bit streams that are
invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in
streams that are not valid UTF-8).

For 8-bit streams, using ESC or \ s generally a good choice of escape to
derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs
are more economical (but PUA code units found in the stream still need to
be escaped.

If you think about the Java regexp "\\uD800", it does not designates a code
point but only a code unit which is not valid plain text alone as it
violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream
can work only if you can reprecent isolated code units for a specific
encoding like UTF-16, even if the targer stream to look for this match uses
any other valid UTF (not necessarily UTF-16: decode the target text,
reencode it to UTF-16 to generate a 16-bit stream in which you'll look for
isolated 16-but code units with the regexp)

So yes the regexp "\\u" (in Java source) is not used to match a single
valid character


2014-06-03 8:21 GMT+02:00 David Starner :

> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
>  wrote:
> > Much as I don't like their uninvited use, it is possible to pass them
> > and other undesirables through most applications by a slight bit of
> > recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
> > characters, one can ape UTF-16 surrogates and encode:
>
> What's the point? If we can use the PUA, then we don't need the
> noncharacters; we can just use the PUA directly. If we have to play
> around with remapping them, they're pointless; they're no easier to
> use in that case then ESC or '\' or PUA characters.
>
> --
> Kie ekzistas vivo, ekzistas espero.
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread David Starner
On Tue, Jun 3, 2014 at 1:02 AM, Richard Wordingham
 wrote:
> On Tue, 3 Jun 2014 00:42:54 -0700
> David Starner  wrote:
>
>> No, the PUA is not. Then where are you getting the 99 PUA characters
>> you suggested using?
>
> By escaping them as well.  The point of the complex scheme is to keep
> searching simple.  Using a general escape character doesn't work so
> well.

The point is, instead of escaping the PUA so you can use the
noncharacters, why not just escape the PUA so you can use the PUA
characters? The latter is simpler and more flexible.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Tue, 3 Jun 2014 00:42:54 -0700
David Starner  wrote:

> On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham
>  wrote:
> > On Mon, 2 Jun 2014 23:21:38 -0700
> > David Starner  wrote:
> >
> >> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
> >>  wrote:
> >> > Using 99 = (3 +
> >> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and
> >> > encode:
> >
> > The PUA is in general not available for
> > general utilities to make special use of.
> 
> No, the PUA is not. Then where are you getting the 99 PUA characters
> you suggested using?

By escaping them as well.  The point of the complex scheme is to keep
searching simple.  Using a general escape character doesn't work so
well.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Mark Davis ☕️
On Tue, Jun 3, 2014 at 9:41 AM, David Starner  wrote:

> Thinking that a utility would never mangle them if encountered in
> input text was a pipe-dream.
>

I didn't say "not mangle", I said "break", as in "crash".

​I don't think this thread is going anywhere productive, so​ I'm signing
off from it.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Tue, 3 Jun 2014 08:55:09 +0200
Mark Davis ☕️  wrote:

> On Mon, Jun 2, 2014 at 10:32 PM, David Starner 
> wrote:
> 
> > Why? It seems you're changing the rules
> > ​...
> >
> >
> This isn't "are changing", it is "has changed". The Corrigendum was
> issued at the start of 2013, about 16 months ago; applicable to all
> relevant earlier versions. It was the result of fairly extensive
> debate inside the UTC; there hasn't been a single issue on this
> thread that wasn't considered during the discussions there. And as
> far back as 2001, the UTC made it clear that noncharacters *are*
> scalar values, and are to be converted by UTF converters. Eg, see
> http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by
> chance, one day before 9/11).

But that says U+FDD0 is not to be externally interchanged!

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread David Starner
On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham
 wrote:
> On Mon, 2 Jun 2014 23:21:38 -0700
> David Starner  wrote:
>
>> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
>>  wrote:
>> > Using 99 = (3 +
>> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode:
>
> The PUA is in general not available for
> general utilities to make special use of.

No, the PUA is not. Then where are you getting the 99 PUA characters
you suggested using?

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread David Starner
On Mon, Jun 2, 2014 at 11:55 PM, Mark Davis ☕️  wrote:
> Thinking that a utility would never encounter them in input text
> was a pipe-dream.

Thinking that a utility would never mangle them if encountered in
input text was a pipe-dream.

> If a utility or library is so fragile that it breaks on
> input of any valid UTF sequence, then it is a "lamebrained" utility.

And?  The world is filled with lamebrained utilities, and being
cautious about what you take in can prevent one of those lamebrained
utilities from turning into an exploit.

> A good
> unit test for any production chain would be to check there is no crash on
> any input scalar value (and for that matter, any ill-formed UTF text).

Right; and if you filter out stuff at the frontend, like ill-formed
UTF text and noncharacters, you don't have to worry about what the
middle end will do with them.

I don't get what the goal of these changes were. It seems you've taken
these characters away from programmers to use them in programs and
given them to CLDR and anyone else willing to make their "plain text
files" skirt the limits.

-- 
Kie ekzistas vivo, ekzistas espero.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Richard Wordingham
On Mon, 2 Jun 2014 23:21:38 -0700
David Starner  wrote:

> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
>  wrote:
> > Using 99 = (3 +
> > 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode:

> What's the point? If we can use the PUA, then we don't need the
> noncharacters; we can just use the PUA directly. If we have to play
> around with remapping them, they're pointless; they're no easier to
> use in that case then ESC or '\' or PUA characters.

A search for two 2-character string '\n' would also find a substring
of 4-character string 'a\\n'.  The PUA is in general not available for
general utilities to make special use of.

Richard. 

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Martin J. Dürst

On 2014/06/03 07:08, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 1:32 PM, David Starner mailto:prosfil...@gmail.com>> wrote:

I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect "handling these" in web browsers and lamebrained
utilities. I expect "treat like unassigned code points".


Expecting them to be treated like unassigned code points shows that 
their use is a bad idea: Since when does the Unicode Consortium use 
unassigned code points (and the like) in plain sight?



I can't shake the suspicion that Corrigendum #9 is not actually solving
a general problem, but is a special favor to CLDR as being run by
insiders, and in the process muddying the waters for everyone else.


I have to fully agree with Asmus, Richard, Shawn and others that the use 
of non-characters in CLDR is a very bad and dangerous example.


However convenient the misuse of some of these codepoints in CLDR may 
be, it sets a very bad example for everybody else. Unicode itself should 
not just be twice as careful with the use of its own codepoints, but 10 
times as careful.


I'd strongly suggest that completely independent of when and how 
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked 
out for how to get rid of these codepoints in CLDR data. The sooner, the 
better.


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 10:32 PM, David Starner  wrote:

> Why? It seems you're changing the rules
> ​...
>
>
This isn't "are changing", it is "has changed". The Corrigendum was issued
at the start of 2013, about 16 months ago; applicable to all relevant
earlier versions. It was the result of fairly extensive debate inside the
UTC; there hasn't been a single issue on this thread that wasn't considered
during the discussions there. And as far back as 2001, the UTC made it
clear that noncharacters *are* scalar values, and are to be converted by
UTF converters. Eg, see
http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance,
one day before 9/11).

> probably trigger serious bugs in some lamebrained utility.

There were already plenty of programs that passed the noncharacters
through; very few would filter them (some would delete them, which is
horrible for security). Thinking that a utility would never encounter them
in input text was a pipe-dream. If a utility or library is so fragile that
it *breaks* on input of any valid UTF sequence, then it *is* a "lamebrained"
utility. A good unit test for any production chain would be to check there
is no crash on any input scalar value (and for that matter, any ill-formed
UTF text).
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
 wrote:
> Much as I don't like their uninvited use, it is possible to pass them
> and other undesirables through most applications by a slight bit of
> recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
> characters, one can ape UTF-16 surrogates and encode:

What's the point? If we can use the PUA, then we don't need the
noncharacters; we can just use the PUA directly. If we have to play
around with remapping them, they're pointless; they're no easier to
use in that case then ESC or '\' or PUA characters.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 15:09:21 -0700
David Starner  wrote:

> So certain programs can't use noncharacters internally because some
> people want to interchange them? That doesn't seem like what
> noncharacters should be used for.

Much as I don't like their uninvited use, it is possible to pass them
and other undesirables through most applications by a slight bit of
recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
characters, one can ape UTF-16 surrogates and encode:

32 × 64 pairs for lone surrogates
 1 × 64 pairs to replace some of the PUA characters
 1 × 35 pairs to replace the rest of the PUA characters
 1 ×  4 pairs for incoming FFFC to 
 1 × 32 pairs for the other BMP non-characters
 1 × 32 pairs for the supplementary plane non-characters.

This then frees up non-characters for the application's use.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Lisa Moore
I would like to point out to Asmus that this decision was reached 
unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC 
Berkeley, and Yahoo!

One might disagree with the decision, but there were no special favors 
involved.

Lisa 

> 
> 
> I can't shake the suspicion that Corrigendum #9 is not actually 
> solving a general problem, but is a special favor to CLDR as being 
> run by insiders, and in the process muddying the waters for everyone 
else.
> 
> A./___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
"reserved for CLDR" would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound to the ICU implementation of CLDR for its own definitions and
syntax for its tailorings).

CLDR data defines its own interface and protocol, it can reserve these code
points only for itself but not in TUS and no other conforming plain-text
application is expected to accept these reservations, so they can
**freely** mark them in error, replace them, or filter them out, or
interpret them differently for their own usage, using their own
specification and encapsulation mechanisms and specific **non-plain-text**
data types.

CLDR data transmitted in binary form that would embed these code points are
not transporting plain-text, this is still a binary datatype specific to
this application. CLDR data must remain isolated in its scope without
forcing other protocols or TUS to follow its practices.

Other applications may develop "gateway" interfaces to convert them to be
interoperable with ICU but they are not required to do that. If they do,
they will follow the ICU specifications, not TUS and this should not
influence their own way to handle what TUS describe as plain-text.

To make it clear, it is referable to just say in TUS that the behavior of
applications with non-characters is completely undefined and unpredictable
without an external specification, and these entities should not even be
considered as encodable in any standard UTFs (which can be freely be
replaced by another one without causing any loss or modification of the
represented plain-text). It should be possible to define other (non
standard) conforming UTFs which are completely unable to represent these
non-characters (as well as any unpaired surrogate). A conforming UTF just
needs to be able to represent streams of scalar values in their full
standard range (even without knowing if they are assigned or not or without
knowing their character properties).

You can/should even design CLDR to completely ovoid the use of
non-characters: it's up to it to define an encapsulation/escaping mechanism
that clearly separates what is standard plain-text in the content and what
is not and used for specific purpose in CLDR or ICU implementations.




2014-06-03 0:07 GMT+02:00 Shawn Steele :

>  Except that, particularly the max-weight ones, mean that developers can
> be expected to use this as sentinels in code using ICU, which would
> preclude their use for other things?
>
>
>
> Which makes them more like “reserved for use in CLDR” than “noncharacters”?
>
>
>
> -Shawn
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Markus
> Scherer
> *Sent:* Monday, June 2, 2014 2:53 PM
> *To:* David Starner
> *Cc:* Unicode Mailing List
> *Subject:* Re: Corrigendum #9
>
>
>
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
> wrote:
>
>  I would especially discourage any web browser from handling
>
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
>
>
>
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
>
>
>
> markus
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
I better expect: "treat them as you like", there will never be any warranty
of interoperability, everyone is allowed to use them as they want and even
change it at any time. The behavior is not defined in TUS, and users cannot
expect that TUS will define this behavior.
There's no clear solution about what to do if you encounter them in data
supposed to be text. For me they are not text, so the whole data could be
rejected or the text remaining after some filtering may be galsely
interpreted. You need an external specification outside TUS.

I certainly do not consider non-characters like unassigned valid code
points where applications are strongly encouraged to not apply any kinf of
filter if they want to remain compatible with evolutions of the standard
that may assign them (the best you can do with unassigned code points is
treat them as symbols, with the minimial properties defined in the standard
(notably Bidi properties according to their range, where this direction is
defined in some ranges, or treat them as symbols with weak direction), even
if applications cannot still render them (renderers will find a way to show
them, generally using a .notdef glyph like empty boxes). Normalizers will
also not mix them (the default combining class should be 0).

Only applications that want to ensure that the text conforms to a specific
version of the standard are allowed to filter out or signal as errors the
presence of unassigned code points. But all applications can do that kind
of things on non-characters (or any code unit whose value falls outside the
valid range of a defined UTFà. This is an important difference.
non-characters are not like unassigned code points, they are assigned to be
considered invalid and filterable by design by any Unicode conforming
process for handling text.





2014-06-02 23:53 GMT+02:00 Markus Scherer :

> On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
> wrote:
>
>> I would especially discourage any web browser from handling
>> these; they're noncharacters used for unknown purposes that are
>> undisplayable and if used carelessly for their stated purpose, can
>> probably trigger serious bugs in some lamebrained utility.
>>
>
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
>
> markus
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Ø  I can't shake the suspicion that Corrigendum #9 is not actually solving a 
general problem, but is a special favor to CLDR as being run by insiders, and 
in the process muddying the waters for everyone else

I think we could generalize to other scenarios so it wasn’t necessarily an 
insider scenario.  For example, I could have a string manipulation library that 
used FFFE to indicate the beginning of an identifier for a localizable 
sentence, terminated by .  Any system using FFFEid1234 would likely 
expect to be able to read the tokens in their favorite code editor.

But I’m concerned that these “conflict” with each other, and embedding the 
behavior in major programming languages doesn’t smell to me like “internal” 
use.  Clearly if I wanted to use that library in a CLDR-aware app, there is a 
potential risk for a conflict.

In the CLDR case, there *IS* a special relationship with Unicode, and perhaps 
it would be warranted to explicitly encode character(s) with the necessary 
meaning(s) to handle edge-case collation scenarios.

-Shawn
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer  wrote:
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner  wrote:
>>
>> I would especially discourage any web browser from handling
>> these; they're noncharacters used for unknown purposes that are
>> undisplayable and if used carelessly for their stated purpose, can
>> probably trigger serious bugs in some lamebrained utility.
>
>
> I don't expect "handling these" in web browsers and lamebrained utilities. I
> expect "treat like unassigned code points".

So certain programs can't use noncharacters internally because some
people want to interchange them? That doesn't seem like what
noncharacters should be used for.

Unix utilities shouldn't usually go to the trouble of messing with
them; limiting the number of changes needed for Unicode was the whole
point of UTF-8. Any program transferring them across the Internet as
text should filter them, IMO; either some lamebrained utility will
open a security hole by using them and not filtering first, or
something will filter them after security checks have been done, or
something. Unless it's a completely trusted system, text files with
these characters should be treated with extreme prejudice by the first
thing that receives them over the net.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote:


I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect "handling these" in web browsers and lamebrained 
utilities. I expect "treat like unassigned code points".




I can't shake the suspicion that Corrigendum #9 is not actually solving 
a general problem, but is a special favor to CLDR as being run by 
insiders, and in the process muddying the waters for everyone else.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
We can still draw a line : interchange should be meant so that other
non-Unicode standards should find their way to not mixup random data within
plain-text without defining a clear encapsulation and escaping mechanism
that ensures that plain text remains isolatable.
In other words, desieng separate layers of representation and processing,
and be more imaginative when you design an application or protocol with a
better modeling.
If an application really internaly needs some non-characters, this is not
reallyfor encoding text but for the application/protocol-specific system of
encapsulation, which should be clearly identified:
- these protocols can use separate APIs for handling objects that are
composite and contain some text but that are not text by themselves.
- they should isolate data types (or MIME types)
- they should use some "magic" identifiers in the headers of their data,
including versioning in their protocol
- they should document internally their own encapsulation/escaping
mechanisms
- they should test them to make sure they preserve the valid text content
without breaking them
As the kind of data is not text, we fall within the design of binary data
formats.

These kinds of statements mean that protocols and API will be improved for
better separation of layers, working more as separate blackboxes. But it's
not up to the Unicode standard to explain how they will do it.

So for me non-characters are not Unicode text, they are not text at all and
we should not attempt to make them legal if we want to allow string designs
of isolation mechanisms that allow this separation of layers. The Unicode
standard offers enough space for this separation, with non-characters
(invalid in all standard UTFs), with onvalid code sequences in standard
UTFs that allow building up specific encodings that must not be called
"UTFs" (or "Unicode" or "UCS" or other terms defined in TUS) and identified
as such in API/protocol designs.

Thnigs would be simply better is TUS did not even define what is a
non-character and if it dd not even suggest that they are legal in "some"
circonstance of text "interchange".



2014-06-02 18:08 GMT+02:00 Mark Davis ☕️ :

> The problem is where to draw the line. In today's world, what's an app?
> You may have a cooperating system of "apps", where it is perfectly
> reasonable to interchange sentinel values (for example).
>
> I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
> where we should make it clearer.)
>
>
> Mark 
>
>  *— Il meglio è l’inimico del bene —*
>
>
> On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
> wrote:
>
>>  I also think that the verbiage swung too far the other way.  Sure, I
>> might need to save or transmit a file to talk to myself later, but apps
>> should be strongly discouraged for using these for interchange with other
>> apps.
>>
>>
>>
>> Interchange bugs are why nearly any news web site ends up with at least a
>> few articles with mangled apostrophes or whatever (because of encoding
>> differences).  Should authors’ tools or feeds or databases or whatever
>> start emitting non-characters from internal use, then we’re going to have
>> ugly leak into text “everywhere”.
>>
>>
>>
>> So I’d prefer to see text that better permitted interchange with other
>> components of an application’s internal system or partner system, yet
>> discouraged use for interchange with “foreign” apps.
>>
>>
>>
>> -Shawn
>>
>>
>>
>> ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Except that, particularly the max-weight ones, mean that developers can be 
expected to use this as sentinels in code using ICU, which would preclude their 
use for other things?

Which makes them more like “reserved for use in CLDR” than “noncharacters”?

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer
Sent: Monday, June 2, 2014 2:53 PM
To: David Starner
Cc: Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
mailto:prosfil...@gmail.com>> wrote:
I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.

I don't expect "handling these" in web browsers and lamebrained utilities. I 
expect "treat like unassigned code points".

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 1:32 PM, David Starner  wrote:

> I would especially discourage any web browser from handling
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
>

I don't expect "handling these" in web browsers and lamebrained utilities.
I expect "treat like unassigned code points".

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer  wrote:
> Right, in principle. However, it should be ok to include noncharacters in
> CLDR data files for processing by CLDR implementations, and it should be
> possible to edit and diff and version-control and web-view those files etc.

Why? It seems you're changing the rules so some Unicode guys can get
oversmart in using Unicode in their systems. You could do the same
thing everyone else does and use special tags or symbols you have to
escape. I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 10:17:04 -0700
Markus Scherer  wrote:

> CLDR collation data defines special contraction mappings that start
> with a noncharacter, for
> http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

> In CLDR 23 and before (when we were still using XML collation syntax),
> these were raw noncharacters in the .xml files.

> As I said earlier:
> it should be ok to include noncharacters in CLDR data files for
> processing by CLDR implementations, and it should be possible to edit
> and diff and version-control and web-view those files etc.

They come as a nasty shock when someone thinks XML files are marked-up
text files.  I'm still surprised that the published human-readable form
of CLDR files should contain automatically applied non-Unicode copyright
claims.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Hmm, I find that disconcerting.  I’d prefer a real Unicode character with 
special weights if that concept’s needed.  And I guess that goes a long ways to 
explaining the interchange problem since clearly the code editor’s going to 
need these ☹

From: Markus Scherer [mailto:markus@gmail.com]
Sent: Monday, June 2, 2014 10:17 AM
To: Shawn Steele
Cc: Asmus Freytag; Doug Ewell; Mark Davis ☕️; Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:
To further my understanding, can someone provide examples of how these are used 
in actual practice?

CLDR collation data defines special contraction mappings that start with a 
noncharacter, for 
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax), these 
were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing by 
CLDR implementations, and it should be possible to edit and diff and 
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
wrote:

> To further my understanding, can someone provide examples of how these are
> used in actual practice?
>

CLDR collation data defines special contraction mappings that start with a
noncharacter, for
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax),
these were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing
by CLDR implementations, and it should be possible to edit and diff and
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
> Oh, look. My mail system converted those nice noncharacters into U+FFFD.
> Was that compliant? Did I deserve what I got? Are those two different 
> questions?

I think I just got spaces.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
To further my understanding, can someone provide examples of how these are used 
in actual practice?  I can't think of any offhand and the closest I get is like 
the old escape characters to get a dot matrix printer to shift modes, or old 
word processor internal formatting sequences.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:38 AM, Shawn Steele wrote:

I agree with Markus; I think the FAQ is pretty clear. (And if not,
that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)


FAQ's are useful, but they are not binding. They are even less binding 
than general explanation in the text of the Core specification, which 
itself doesn't rise to the that of conformance clauses and definition...


Doug's unease about the "upside-down" nature of the wording regarding 
PUA and noncharacters is something that should be addressed in revised 
text in the core specification.


A./


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
I disagree with that characterization, of course.

The recommendation for libraries and low-level tools to pass them through
rather than screw with them makes them usable. The recommendation to check
for noncharacters from unknown sources and fix them was good advice then,
and is good advice now. Any app where input of noncharacters causes
security problems or crashes is and was not a very good app.


Mark 

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag  wrote:

>  On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:
>
>
> On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
> wrote:
>
>> The “problem” is now that previously these characters were illegal
>
>
>  The problem was that we were inconsistent in standard and related
> material about just what the status was for these things.
>
>
>   And threw the baby out to fix it.
>
> A./
>
>
>  Mark 
>
>  *— Il meglio è l’inimico del bene —*
>
>
> ___
> Unicode mailing 
> listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode
>
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
I wrote, sort of:
 
> Correct. Most people wouldn't consider a cooperating system like that
> quite the same as true public interchange, like throwing this ���
> into a message on a public mailing list.

Oh, look. My mail system converted those nice noncharacters into U+FFFD.
Was that compliant? Did I deserve what I got? Are those two different
questions?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
> > I agree with Markus; I think the FAQ is pretty clear. (And if not, 
> > that's where we should make it clearer.)

> But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:


On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:


The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related 
material about just what the status was for these things.




And threw the baby out to fix it.

A./


Mark 
/
/
/— Il meglio è l’inimico del bene —/
//


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote:
The problem is where to draw the line. In today's world, what's an 
app? You may have a cooperating system of "apps", where it is 
perfectly reasonable to interchange sentinel values (for example).


The way to draw the line is to insist on there being an agreement 
between sender and ultimate receiver, and an pass-through agreement (if 
you will) for any intermediate stage, so that the coast is clear.


What defines an "implementation" in this scenario, is the existence of 
the agreement.


What got us into trouble is that the negative case (pass-through) was 
not well-defined, and lead to people assuming that they had to filter 
any incoming noncharacters.


Because noncharacters can have any interpretation (not limited to 
interpretations as characters), it is much riskier to send then out 
oblivious whether the intended recipient is part of the same agreement 
on their interpretation as the sender. In that sense, they are not mere 
PUA code points.


The other aspect of their original design was to allow code points that  
recipients were free no to honor or preserve, if they were not part of 
the agreement (and hadn't made an explicit or implicit pass-through 
agreement). Otherwise, if anyone expects them to be preserved, no 
application like Word, would be free to use these for purely internal use.


Word thus would not be a tool to handle CLDR data; which may be 
disappointing to some, but should be fine.


A./


I agree with Markus; I think the FAQ is pretty clear. (And if not, 
that's where we should make it clearer.)



Mark 
/
/
/— Il meglio è l’inimico del bene —/
//


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:


I also think that the verbiage swung too far the other way.  Sure,
I might need to save or transmit a file to talk to myself later,
but apps should be strongly discouraged for using these for
interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at
least a few articles with mangled apostrophes or whatever (because
of encoding differences).  Should authors’ tools or feeds or
databases or whatever start emitting non-characters from internal
use, then we’re going to have ugly leak into text “everywhere”.

So I’d prefer to see text that better permitted interchange with
other components of an application’s internal system or partner
system, yet discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.org 
http://unicode.org/mailman/listinfo/unicode




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
Shawn Steele  wrote:

> So I’d prefer to see text that better permitted interchange with other
> components of an application’s internal system or partner system, yet
> discouraged use for interchange with "foreign" apps.

If any wording is to be revised, while we're at it, I'd also like to see
a reaffirmation of the proper relationship between private-use
characters and noncharacters. I still hear arguments that private-use
characters are to be avoided in public interchange at all costs, as if
lack of knowledge of the private agreement, or conflicting
interpretations, will cause some kind of major security breach. At the
same time, the Corrigendum seems to imply that noncharacters in public
interchange are no big deal. That seems upside-down.

Mark Davis 🍝  replied:

> The problem is where to draw the line. In today's world, what's an
> app? You may have a cooperating system of "apps", where it is
> perfectly reasonable to interchange sentinel values (for example).

Correct. Most people wouldn't consider a cooperating system like that
quite the same as true public interchange, like throwing this ���
into a message on a public mailing list.

Since the Corrigendum deals with recommendations rather than hard
requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright
line is really needed.

> I agree with Markus; I think the FAQ is pretty clear. (And if not,
> that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity,
right?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
wrote:

> The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related material
about just what the status was for these things.



Mark 

 *— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
That’s what I think is exactly what should be clarified.  A cooperating system 
of apps should likely use some other markup, however if they want to use  
to say “OK to insert ad here” (or whatever), that’s up to them.

I fear that the current wording says “Because you might have a cooperating 
system of apps that all agree  is ‘OK to insert ad here’, you may as well 
emit  all the time just in case some other app happens to use the same 
sentinel”.

The “problem” is now that previously these characters were illegal, so my 
application didn’t have to explicitly remove them when importing external stuff 
because they weren’t allowed to be there.  With the wording of the corrigendum, 
the onus is on every app importing data to filter out these code points because 
they are “suddenly” legal in foreign data streams.

That is a breaking change for applications, and, worse, it isn’t in the control 
of the applications that take advantage of the newly laxer wording, but rather 
all the other applications on the planet, which may have been stable for years.

My interpretation of “interchanged” was “interchanged outside of a system that 
understood your private use of the noncharacters”.  I can see where that may 
not have been everyone’s interpretation, and maybe should be updated.  My 
interpretation of what you’re saying below is “sentinel values with a private 
meaning can be exchanged between apps”, which is what the PUA’s for.

I don’t mind at all if the definition is loosened somewhat, but if we’re 
turning them into PUA characters we should just turn them into PUA characters.

-Shawn

From: mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] On 
Behalf Of Mark Davis ??
Sent: Monday, June 2, 2014 9:08 AM
To: Shawn Steele
Cc: Markus Scherer; Doug Ewell; Unicode Mailing List
Subject: Re: Corrigendum #9

The problem is where to draw the line. In today's world, what's an app? You may 
have a cooperating system of "apps", where it is perfectly reasonable to 
interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where 
we should make it clearer.)


Mark<https://google.com/+MarkDavis>

— Il meglio è l’inimico del bene —

On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:
I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.org<mailto:Unicode@unicode.org>
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
The problem is where to draw the line. In today's world, what's an app? You
may have a cooperating system of "apps", where it is perfectly reasonable
to interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
where we should make it clearer.)


Mark 

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
wrote:

>  I also think that the verbiage swung too far the other way.  Sure, I
> might need to save or transmit a file to talk to myself later, but apps
> should be strongly discouraged for using these for interchange with other
> apps.
>
>
>
> Interchange bugs are why nearly any news web site ends up with at least a
> few articles with mangled apostrophes or whatever (because of encoding
> differences).  Should authors’ tools or feeds or databases or whatever
> start emitting non-characters from internal use, then we’re going to have
> ugly leak into text “everywhere”.
>
>
>
> So I’d prefer to see text that better permitted interchange with other
> components of an application’s internal system or partner system, yet
> discouraged use for interchange with “foreign” apps.
>
>
>
> -Shawn
>
>
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell  wrote:

> I suspect everyone can agree on the edge cases, that noncharacters are
> harmless in internal processing, but probably should not appear in
> random text shipped around on the web.
>

Right, in principle. However, it should be ok to include noncharacters in
CLDR data files for processing by CLDR implementations, and it should be
possible to edit and diff and version-control and web-view those files etc.

It seems that trying to define "interchange" and "public" in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be used,
should be preserved, or could be stripped, starting with "Q: Are
noncharacters intended for interchange?
"

In my view, those Q/A pairs explain noncharacters quite well. If there are
further examples of where noncharacters might be used, should be preserved,
or could be stripped, and that would be particularly useful to add to the
examples already there, then we could add them.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Doug Ewell
It seems that the broadening of the term "interchange" in this
corrigendum to mean "almost any type of processing imaginable," below,
is what caused the trouble. This is the decision that would need to be
reconsidered if the real intent of noncharacters is to be expressed.

I suspect everyone can agree on the edge cases, that noncharacters are
harmless in internal processing, but probably should not appear in
random text shipped around on the web.

> This is necessary for the effective use of noncharacters, because
> anytime a Unicode string crosses an API boundary, it is in effect
> being "interchanged". Furthermore, for distributed software, it is
> often very difficult to determine what constitutes an "internal"
> versus an "external" context for any particular software process.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Asmus Freytag

On 6/1/2014 9:07 AM, Markus Scherer wrote:
On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson 
mailto:pub...@khwilliamson.com>> wrote:


Thanks, I had not thought about that.  I'm thinking wording
something like this is more appropriate

"Noncharacters may be openly interchanged, but it is inadvisable
to do so without prior agreement, since at each stage any of them
might be replaced by a REPLACEMENT CHARACTER or otherwise disposed
of, at the sole discretion of that stage's implementation."


I think that would invite again the kinds of implementations that 
triggered Corrigendum #9, where you couldn't use CLDR files with 
Gnome-based tools (plain text editors, file diff tools, command-line 
terminal) if the files contained noncharacters. (CLDR data uses them 
for boundary mappings in collation data.)



The new text triggers some really unwarranted interpretations, which can 
invalidate the use of noncharacters for their stated purpose.


Please see my suggested text that attempts to describe both intent and 
differences in use.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Asmus Freytag

On 6/1/2014 7:49 AM, Karl Williamson wrote:

On 05/30/2014 12:49 PM, Asmus Freytag wrote:

One of the concerns was that people felt that they had to have "data
pipeline" style implementations (tools) go and filter these out - even
if there was no intent for the implementation to use them internally in
any way. Making clear that the standard does not require filtering
allows for cleaner implementations of such ("path through) tools.


Thanks, I had not thought about that.  I'm thinking wording something 
like this is more appropriate


"Noncharacters may be openly interchanged, but it is inadvisable to do 
so without prior agreement, since at each stage any of them might be 
replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the 
sole discretion of that stage's implementation."



Karl,

I think you should address the pass-through style of implementation 
explicitly.


"Noncharacters are designed to be used for special, 
implementation-internal purposes, that puts them outside the text 
content of the data. Some implementations, by necessity, use a 
distributed architecture, and rely on yet other implementations for 
services like transport, code conversion, and so on. For such 
"pass-through" implementations, it would be inadvisable to rely on, or 
replace any noncharacter, and certainly not to reject or filter them. 
Doing so would make such an implementation a poor choice to serve as a 
"pass through" in a distributed architecture that makes use of 
noncharcters for internal purposes. In other words such an 
implementation would make it impossible to bridge between the partners 
in a prior agreement on the use of noncharacters, which would severely 
undercut its utility."


You might want to check whether some statement like this isn't already 
part of the FAQ. If it isn't, it would be the easiest to retrofit (and 
the easiest place to lay out usage guidelines).


Alternatively, or in conjunction, you could propose that the text in the 
core specification be tweaked to help set better expectations.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Karl Williamson

On 06/01/2014 10:07 AM, Markus Scherer wrote:

On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson mailto:pub...@khwilliamson.com>> wrote:

Thanks, I had not thought about that.  I'm thinking wording
something like this is more appropriate

"Noncharacters may be openly interchanged, but it is inadvisable to
do so without prior agreement, since at each stage any of them might
be replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at
the sole discretion of that stage's implementation."


I think that would invite again the kinds of implementations that
triggered Corrigendum #9, where you couldn't use CLDR files with
Gnome-based tools (plain text editors, file diff tools, command-line
terminal) if the files contained noncharacters. (CLDR data uses them for
boundary mappings in collation data.)

markus


I don't understand your point.  Are you saying that Gnome should not 
have the discretion to rid its inputs of noncharacters?  If so, then 
noncharacters really are just Gc=Co ones.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Markus Scherer
On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson 
wrote:

> Thanks, I had not thought about that.  I'm thinking wording something like
> this is more appropriate
>
> "Noncharacters may be openly interchanged, but it is inadvisable to do so
> without prior agreement, since at each stage any of them might be replaced
> by a REPLACEMENT CHARACTER or otherwise disposed of, at the sole discretion
> of that stage's implementation."


I think that would invite again the kinds of implementations that triggered
Corrigendum #9, where you couldn't use CLDR files with Gnome-based tools
(plain text editors, file diff tools, command-line terminal) if the files
contained noncharacters. (CLDR data uses them for boundary mappings in
collation data.)

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Karl Williamson

On 05/30/2014 12:49 PM, Asmus Freytag wrote:

One of the concerns was that people felt that they had to have "data
pipeline" style implementations (tools) go and filter these out - even
if there was no intent for the implementation to use them internally in
any way. Making clear that the standard does not require filtering
allows for cleaner implementations of such ("path through) tools.


Thanks, I had not thought about that.  I'm thinking wording something 
like this is more appropriate


"Noncharacters may be openly interchanged, but it is inadvisable to do 
so without prior agreement, since at each stage any of them might be 
replaced by a REPLACEMENT CHARACTER or otherwise disposed of, at the 
sole discretion of that stage's implementation."

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-01 Thread Philippe Verdy
Ok then, the definitions still dors not say that blocks cannot be split (in
fact it has already occured many time across versions by reevaluating the
need for new blocks and for desifying the BMP, up to the point that
sometime a single addition in the same script required allocating columns
in multiple subblocks as small as a column of 16 code points).

Blocks are in fact artefacts of the encoding process the y are previsional
until the characters needed are effectively allocated. Later any unused
area may be reallocated to another block.

On the BMP for example there remains a quite large area in a block
initially described for supplemental arrows that could host a new full
alphabetic script (most probably one of the remaining Indic or African
modern scripts still to encode) or symbols used in common softwares or
devices for their UI and its documentation (such as the window
minimize/maximize/close button or resize corner, or refresh button, or
microphone symbol to initiate a vocal talk, or the radio wave symbol for
accessing a wireless network), or conventional symbols for accessibility
devices, marks of dangers/hazards or restrictions/prohibitions that could
be used as widely as currency symbols (encoded often in emergency but
isolately, unlike other symbols coming in small related groups; if these
collections are large like emoticons/emojis they'll go directly in the SMP).

Blocks are not immutable in size, even if they keep their initial position
(because allocations in blocks start by the leang position, skeeping only a
few entries that were balloted for possible later allocation to the same
script, or for former proposals of characters that were balloted in favor
of unification to another character, or just to align the block with the
layout of another legacy encoding chart, or because the initial beta fonts
submitted to support the script allocated other characters that were not
approved and fonts were not updated to use a new layout).

May be in some future we will see a few more allocations made in the BMP
using half columns (this is *already* the case at end of the BMP where a
single column is split in two parts, containing Armenian presentation
forms, and Hebrew presentation forms for Yiddish...), or filling some
random holes for which it is definitively decided that the initial
reservations in the roadmap will never be used for the initially intended
purpose.



2014-06-01 8:20 GMT+02:00 Asmus Freytag :

> On 5/31/2014 10:06 PM, Philippe Verdy wrote:
>
>> I've not proposed to move these characters elsewhere (or ro reencode
>> them), why do you think that?.
>>
>> I just challenge your statement that a block cannot be discontinuous,
>>
>
> Well, go ahead and challenge that.
>
> As implemented in the current nameslist and file blocks.txt a block would
> have this definition. "A block is a uniquely named, continuous,
> non-overlapping range of code points, containing a multiple of 16 code
> points, and starting at a location that is a multiple of 16."
>
> Per chapter 3 the definition of the property block is given in Section
> 17.1 (Code Charts) - which contains no actual definition, only tells you
> how they are used in organizing the code charts, so, effectively, a block
> is what blocks.txt (and therefore the names list) say it is. The way blocks
> are assigned, has been following the empirically derived definition I gave
> above, and at this point, the production process for the code charts has
> some of these restrictions built in.
>
> Chapter 3 calls blocks an enumerated property, meaning that the names must
> be unique, and blocks.txt associates a single range with a name, in
> concurrence with the glossary, which says blocks represent a range of
> characters (not a collection of ranges). Likewise, changing blocks to not
> starting at or containing multiples of 16 code points (sometimes called a
> "column") is equally not in the cards - it would break the very production
> process for chart production. The description of how blocks are used does
> not contemplate that they can be mutually overlapping, so that becomes part
> of their implicit definition as well.
>
> There's reason behind the madness of not providing an explicit definition
> of "block" in the standard. It has to do with discouraging people from
> relying on what is largely an editorial device (headers on charts).
> However, it does not mean that arbitrary redefinition of a block from a
> single to multiple ranges is something that can or should be contemplated.
>
> So, the chances that UTC would agree to such changes, even if not formally
> guaranteed, is de facto nil.
>
> A./
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Asmus Freytag

On 5/31/2014 10:06 PM, Philippe Verdy wrote:
I've not proposed to move these characters elsewhere (or ro reencode 
them), why do you think that?.


I just challenge your statement that a block cannot be discontinuous,


Well, go ahead and challenge that.

As implemented in the current nameslist and file blocks.txt a block 
would have this definition. "A block is a uniquely named, continuous, 
non-overlapping range of code points, containing a multiple of 16 code 
points, and starting at a location that is a multiple of 16."


Per chapter 3 the definition of the property block is given in Section 
17.1 (Code Charts) - which contains no actual definition, only tells you 
how they are used in organizing the code charts, so, effectively, a 
block is what blocks.txt (and therefore the names list) say it is. The 
way blocks are assigned, has been following the empirically derived 
definition I gave above, and at this point, the production process for 
the code charts has some of these restrictions built in.


Chapter 3 calls blocks an enumerated property, meaning that the names 
must be unique, and blocks.txt associates a single range with a name, in 
concurrence with the glossary, which says blocks represent a range of 
characters (not a collection of ranges). Likewise, changing blocks to 
not starting at or containing multiples of 16 code points (sometimes 
called a "column") is equally not in the cards - it would break the very 
production process for chart production. The description of how blocks 
are used does not contemplate that they can be mutually overlapping, so 
that becomes part of their implicit definition as well.


There's reason behind the madness of not providing an explicit 
definition of "block" in the standard. It has to do with discouraging 
people from relying on what is largely an editorial device (headers on 
charts). However, it does not mean that arbitrary redefinition of a 
block from a single to multiple ranges is something that can or should 
be contemplated.


So, the chances that UTC would agree to such changes, even if not 
formally guaranteed, is de facto nil.


A./

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Philippe Verdy
I've not proposed to move these characters elsewhere (or ro reencode them),
why do you think that?.

I just challenge your statement that a block cannot be discontinuous,
something that is unique in all Unicode properties and completely absent
from ISO 10646 which does not define any real properties beside a name in a
specific code point and some informative glyph, plus historic reference
links documenting its intended usage. (Where is it written in the
Unicode-only stability rules that is continuous when allocations of
codepoints in these blocs has always been discontinuous?...), much more
important than this legacy one which has absolutely no use in regexps as
you stated.

Even the set of non-characters is also discontinuous, as well as blocks for
the Arabic script.; or blocks for presentation forms, or blocks for
compatibility characters. Every property in Unicode is fragmented over
multiple ranges (whose length is also extremely frequently discontinuous
within each block or even in the same encoding column

In other words IsInArabicPresentation(x) would still remain true for all
assgned characters in that block, it will just be false for non-characters
considered outside of it but non-characters don't have nay useful property
except being non-character (the block where they are allocated does not
matter at all).

The alternative is to not restrict these characters as being non-characters
and allowing them to be present in files without enforcing any error, i.e.
treat it like PUA, also with a feow possible default properties (this makes
them a bit interoperable still with limited private agreements, possibly
implicit with the transport interface or enveloppe format).

2014-06-01 4:15 GMT+02:00 Asmus Freytag :

>  More importantly, while a regex that uses an expression that is
> equivalent to "IsInArabiPresentation(x)" may or may not be well-defined,
> there is no reason to break it by splitting the block.
>
> As blocks cannot be discontiguous (unlike other properties), some Arabic
> Presentation forms would have to be put into a new block (Arabic
> Presentation Forms C). This is what would break such expressions - it has,
> in fact, nothing to do with the status of the noncharacters.
>
> There's no reason to contemplate breaking changes of any kind at this
> point.
>
> A./
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Asmus Freytag

On 5/31/2014 12:36 PM, Philippe Verdy wrote:
May be; but there's real doubt that a regular expression that would 
need this property would be severely broken if that property was 
corrected. There are many other properties that are more useful (and 
mich more used) whose associated set of codepoints changes regularly 
across versions.


we have learned that there are always more implementations of a feature 
than we might have predicted. That has been true, for Unicode, from day one.


More importantly, while a regex that uses an expression that is 
equivalent to "IsInArabiPresentation(x)" may or may not be well-defined, 
there is no reason to break it by splitting the block.


As blocks cannot be discontiguous (unlike other properties), some Arabic 
Presentation forms would have to be put into a new block (Arabic 
Presentation Forms C). This is what would break such expressions - it 
has, in fact, nothing to do with the status of the noncharacters.


There's no reason to contemplate breaking changes of any kind at this point.

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Mark Davis ☕️
A few quick items. (I admit to only skimming your response, Phillipe; there
is only so much time in the day.)

Any discussion of changing non-characters is really pointless. See
http://www.unicode.org/policies/property_value_stability_table.html

As to breaking up the block, that is not forbidden: but one would have to
give pretty compelling arguments that the benefits would outweigh any
likely problems, especially since we already don't recommend the use of the
block property in regexes.

> And regular expressions trying to use character properties have many more
caveats to handle (the most serious being with canonical equivalences and
discontinuous matches or partial matches.

The UTC, after quite a bit of work, concluded that it was not feasible with
today's regex engines to handle normalization automatically, instead
recommending the approach in
http://www.unicode.org/reports/tr18/#Canonical_Equivalents

> Regexps are still a very experimental proposal, they are still very
difficult to make interoperatable except in a small set of tested cases

I have no idea where this is coming from. Regexes using Unicode properties
are in widespread and successful use. It is not that hard to make them
interoperable (as long as both implementations are using the same version
of Unicode).


Mark 

 *— Il meglio è l’inimico del bene —*


On Sat, May 31, 2014 at 9:36 PM, Philippe Verdy  wrote:

> May be; but there's real doubt that a regular expression that would need
> this property would be severely broken if that property was corrected.
> There are many other properties that are more useful (and mich more used)
> whose associated set of codepoints changes regularly across versions.
>
> I don't see any specific interest in maintaining non-characters in that
> block, as it effectively reduces the reusaibility of this property.
> And in fact it would be highly preferable to no longer state that these
> non-characters in ArabicPresenationForm be treated like C1 controls or PUA
> (because they will ever be reassigned to something more useful). Making
> them PUA would not change radically the fact thzt these characters are not
> recommended but we xould no longer bother about checking if they are valid
> or not. They remain there only as a legacy with old outdated versions of
> Unicode for a mysterious need that I"ve not clearly identified.
>
> Let's assume we change them into PUA; some applications will start
> accepting them when some other won't. Not a problem given that they are
> already not interoperable.
>
> And regular expressions trying to use character properties have many more
> caveats to handle (the most serious being with canonical equivalences and
> discontinuous matches or partial matches; when searches are only focuing on
> exact sets of code points instead of sets of canonical equivalent texts;
> the other complciation coming with the effect of collation and its variable
> strength matching more or less parts of text spanning ignorable collation
> elements i.e, possibly also, discontinuous runs of ignorable codepoints if
> we want to get consistant results independant of th normalization form.
> more compicate is how to handle "partial matches" such as a combining
> character within a precomposed character which is canonically equivalent to
> string where this combining character appears
>
> And even more tricky is how to handle substitution with regexps, for
> example when perfrming search at primary collation level ignoring
> lettercase, but when we wnt to replace base letters but preserve case in
> the substituted string: this requires specific lookup of characters using
> properties **not** specified in the UCD but in the collation tailoring
> data, and then how to ensure that the result of the substitution in the
> plain-text source will remain a valid text not creating new unexpected
> canonical equivalences, and that it will also not break basic orthographic
> properties such as syllabic structures in a specific pair of
> language+script, and without also producing unexpected collation
> equivalents at the same collation strength; causing later unexpected never
> ending loops of subtitutions, for example in large websites with bots
> operating text corrections).
>
> Regexps are still a very experimental proposal, they are still very
> difficult to make interoperatable except in a small set of tested cases and
> for this reason I really doubt that the "characetrs encoding block"
> property is very productive for now with regexps (and notably not with this
> "compatibility" block, whose characters wll remain used isolately
> independantly of their context, if they are still used in rare cases).
>
> I see little value in keeping this old complication in this block, but
> just more interoperability problems for implementations. So these non
> characters should be treated mostly like PUA, except that they have a few
> more properties : direction=RTL, script= Arabic, and st

Re: Corrigendum #9

2014-05-31 Thread Philippe Verdy
May be; but there's real doubt that a regular expression that would need
this property would be severely broken if that property was corrected.
There are many other properties that are more useful (and mich more used)
whose associated set of codepoints changes regularly across versions.

I don't see any specific interest in maintaining non-characters in that
block, as it effectively reduces the reusaibility of this property.
And in fact it would be highly preferable to no longer state that these
non-characters in ArabicPresenationForm be treated like C1 controls or PUA
(because they will ever be reassigned to something more useful). Making
them PUA would not change radically the fact thzt these characters are not
recommended but we xould no longer bother about checking if they are valid
or not. They remain there only as a legacy with old outdated versions of
Unicode for a mysterious need that I"ve not clearly identified.

Let's assume we change them into PUA; some applications will start
accepting them when some other won't. Not a problem given that they are
already not interoperable.

And regular expressions trying to use character properties have many more
caveats to handle (the most serious being with canonical equivalences and
discontinuous matches or partial matches; when searches are only focuing on
exact sets of code points instead of sets of canonical equivalent texts;
the other complciation coming with the effect of collation and its variable
strength matching more or less parts of text spanning ignorable collation
elements i.e, possibly also, discontinuous runs of ignorable codepoints if
we want to get consistant results independant of th normalization form.
more compicate is how to handle "partial matches" such as a combining
character within a precomposed character which is canonically equivalent to
string where this combining character appears

And even more tricky is how to handle substitution with regexps, for
example when perfrming search at primary collation level ignoring
lettercase, but when we wnt to replace base letters but preserve case in
the substituted string: this requires specific lookup of characters using
properties **not** specified in the UCD but in the collation tailoring
data, and then how to ensure that the result of the substitution in the
plain-text source will remain a valid text not creating new unexpected
canonical equivalences, and that it will also not break basic orthographic
properties such as syllabic structures in a specific pair of
language+script, and without also producing unexpected collation
equivalents at the same collation strength; causing later unexpected never
ending loops of subtitutions, for example in large websites with bots
operating text corrections).

Regexps are still a very experimental proposal, they are still very
difficult to make interoperatable except in a small set of tested cases and
for this reason I really doubt that the "characetrs encoding block"
property is very productive for now with regexps (and notably not with this
"compatibility" block, whose characters wll remain used isolately
independantly of their context, if they are still used in rare cases).

I see little value in keeping this old complication in this block, but just
more interoperability problems for implementations. So these non characters
should be treated mostly like PUA, except that they have a few more
properties : direction=RTL, script= Arabic, and starters working in
isolation for the Arabic joining type (these properties can help limit
their generic reusability like regular PUAs but at least all other
processes and notably generic validtors won't have to bother about them).

2014-05-31 18:17 GMT+02:00 Asmus Freytag :

>  On 5/31/2014 4:09 AM, Philippe Verdy wrote:
>
>  2014-05-30 20:49 GMT+02:00 Asmus Freytag :
>
>> This might have been possible at the time these were added, but now it is
>> probably not feasible. One of the reasons is that block names are exposed
>> (for better or for worse) as character properties and as such are also
>> exposed in regular expressions. While not recommended, it would be really
>> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)"
>> were to fail, because we split the block into three (with the middle one
>> being the noncharacters).
>>
>
>  If you think about pseudocode testing for properties then nothing
> forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead
> of just one.
>
> Besides the point.
>
> The issue is that the result of evaluation of an expression would change
> over time.
>
> A./
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Asmus Freytag

On 5/31/2014 4:09 AM, Philippe Verdy wrote:
2014-05-30 20:49 GMT+02:00 Asmus Freytag >:


This might have been possible at the time these were added, but
now it is probably not feasible. One of the reasons is that block
names are exposed (for better or for worse) as character
properties and as such are also exposed in regular expressions.
While not recommended, it would be really bad if the expression
with pseudo-code "IsInArabicPresentationFormB(x)" were to fail,
because we split the block into three (with the middle one being
the noncharacters).


If you think about pseudocode testing for properties then nothing 
forbifs the test IsInArabicPresentationFormB(x) to check two ranges 
onstead of just one.

Besides the point.

The issue is that the result of evaluation of an expression would change 
over time.


A./

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Philippe Verdy
2014-05-30 20:49 GMT+02:00 Asmus Freytag :

> This might have been possible at the time these were added, but now it is
> probably not feasible. One of the reasons is that block names are exposed
> (for better or for worse) as character properties and as such are also
> exposed in regular expressions. While not recommended, it would be really
> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)"
> were to fail, because we split the block into three (with the middle one
> being the noncharacters).
>

If you think about pseudocode testing for properties then nothing forbifs
the test IsInArabicPresentationFormB(x) to check two ranges onstead of just
one. Almost all character properties are using multiple ranges of
characters (including the more useful properties needed in lots of place in
the code, so updating it so that the property covers two ranges is not a
major change).

But anyway, I have never see the non-characters in the Arabic presentation
forms used elsewhere than within legacy Arabic fonts, using these code
points to map... Arabic presentation forms.

OK, text documents do not need to encode these legay forms in order to use
these fonts (text renderers don't need them with modern OpenType fonts but
will still use them in legacy non-OpenType TTF fonts, as a tentative
fallback to render these contextual forms).

So basically there's no interchange of *text* but the fonts using these
codepoints are still interchanged.

I think it would be better to just reassign these characters are
compatibility characters (or even as PUA) and not as non-characters. I see
no rational for keeping them illegal, when it just cause unnecessary
complications for document validation.

After all most C0 and C1 controls also don't have any other interchangeable
semantic except being "controls" which are always application and protocol
dependant (not meant for encoding texts, except in legacy more or less
"rich" encodings (e.g. for storing escape sequences, not standardized and
fully dependant on the protocol or terminal type, or on various legacy
standards that did not separate text from style, or for the many protocols
that need them for special purpose, such as tagginh content, switching code
pages changing colors and font styles, positioning on a screen or input
form, adding foratting metadata, implement out-of-band commands,
starting/stopping records, pacing the bandwidth use,
starting/ending/redirecting/splitting/merging sessions, embed non-text
content such as bitmap images or structured data, changing transport
protocol options such as compresion schemes, exchanhing
encryption/decryption keys, adding checksum controls or autocorrection
data, marking redundant data copies, inserting resynchronization points for
error recovery...)

So these "non-characters" in Arabic presentation forms are to be treated
more or less like most C1 controls that have undefined behavior. Saying
that there's a need for a "prior agreement" the agreement may be explicit
by the fact that they are used in some old font formats (the same is true
about old fonts using PUA assignments: the kind of agreement is basically
the same, and in both cases, fonts are not plain-text documents).

So the good queston for us is only to be able to reply to this question:
"is this document a valid and conforming plain-text ?"

If:
  * (1) your document contains
  - any one in most of the C0 or C1 controls (except CR, LF, VT, FF, and NL
from C1)
  - any one in PUA
  - any one in non-characters
  - any unpaired surrogates
  * and (2) your document does not validate its encoding scheme,
Then it is not plain-text (to be interchangeable it also needs a recognized
standard encoding, which also requires an agreement or a specification in
the protocol or file format used to transport it).

Personnally I think that surrogates are also non-characters. They are not
assigned to any character even if a pair of encodings are using them
internally to represent code units (not directly code points which are
converted first in two code units); this means that some documents are
valid UTF16 and UTF-32 documents even if they are not plain-text with the
current system (I don't like this situation because UTF-16 and UTF-32
documents are supposed to be interchangeable, even if they are not all
convertible to UTF-8).

But with the non-characters in the Arabic presentation forms, all is made
as if they were reserved for a possible future encoding that could use them
internally for representing some text using sequences of code units
containing or starting by them, or for some still mysterious encoding of a
PUA agreement with an unspecified protocol (exactly the same situation as
with most C1 controls), or as possible replacement for some code units that
could collide with the internal use of some standard controls with some
protocols (e.g. to reencode a NULL, or to delimit the end of an
variable-length escape sequence, when all other C0 and C1 controls are
already u

Re: Corrigendum #9

2014-05-31 Thread Richard Wordingham
On Fri, 30 May 2014 12:26:18 -0600
Karl Williamson  wrote:

> I'm having a problem with this
> http://www.unicode.org/versions/corrigendum9.html

> Some people now think it means that noncharacters are really no 
> different from private-use characters, and should be treated very 
> similarly if not identically.

> It seems to me that they should be illegal in open interchange, or 
> perhaps illegal in interchange without prior agreement.

So one just puts a notice on the web site saying that by downloading
CLDR files one agrees to accept non-characters.  Part of the original
problem is that the CLDR mechanism for identifying Unicode scalar
values in XML rather than quoting them (albeit by numeric entities) was
broken.

> Thus, I don't see how noncharacters can be considered to be valid in 
> public interchange, given that the producers have to assume that the 
> consumers will not accept them.

The publishing of the CLDR data was strictly limited to the Milky Way,
and will remain so for several decades at the very least.  Therefore it
was not public interchange.

Practically, there is the very real issue that a system may be useful
enough to be used as part of a larger system, and therefore called
upon to handle any Unicode scalar value.  One possible solution is to
use, instead of non-characters, lone low surrogates.  These have the
advantage of having obvious representations for use with all three
coding forms. Of course, internal checks on the well-formedness of
Unicode strings would have to be relaxed, and one might prefer to use
them doubled in UTF-16 so as not to weaken checks for broken strings.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Richard Wordingham
On Fri, 30 May 2014 13:22:58 -0700
Markus Scherer  wrote:

> In addition, the Block property is not particularly useful even in
> regular expressions or other processing. It is almost always more
> useful to use Script, Alphabetic, Unified_Ideograph, etc.
> Blocks help with planning and allocation but little else.

They also help with the code charts.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Markus Scherer
In addition, the Block property is not particularly useful even in regular
expressions or other processing. It is almost always more useful to use
Script, Alphabetic, Unified_Ideograph, etc.
Blocks help with planning and allocation but little else.
markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Whistler, Ken
Skipping over the wording related to noncharacters for the moment,
let me address the block stability issue:

> I also am curious as to why the consecutive group of 32 noncharacters
> can't be split off into its own block instead of being part of an Arabic
> one.  I'm unaware of any stability policy forbidding this.  Another
> block is to be split, if I recall correctly, to accommodate the new
> Cherokee characters.

Actually, this is *not* correct.

The Latin Extended-E block *will* be first published in Unicode 7.0
next month. In the charts for that version and in Blocks.txt, the
range for Latin Extended-E is AB30..AB6F.

True, it was initially approved with a more extended range, and was
long shown with the longer range in the Roadmap. But the Roadmap
is just a "roadmap", and not the standard. The new range allocated
to the Cherokee Supplement (AB70..ABBF) is in ballot now, so that
allocation is not final, although I personally consider it unlikely to change
before publication next year.

At any rate the revision of the range for the Latin Extended-E block occurred 
before
actual publication of that block.

The net net here is that the last major churning of block boundaries dates
all the way back to Unicode 1.1 times and the great Hangul Catastrophe.
And the last time any formal block boundary was touched was in 2002,
when all blocks were firmly ended on xxxF boundaries as part of synchronizing
documentation between the Unicode Standard and 10646.
And while there is indeed no actual stability guarantee in place that would
absolutely prevent the UTC or SC2 from adjusting a block boundary if it
decided to, the committees are very unlikely to do so, for the reasons
that Asmus cited.

Keep in mind that even if the UTC, for some reason, decided it would be
a cool idea to split the Arabic Presentation Forms-A block into a new, shorter
range and two new blocks, just so FDD0..FDEF could have its own
block identity for the noncharacter range, it would be rather likely that
a fight would then ensue in the SC2 framework over balloting for such
a change to be synchronized in 10646. Nobody has the stomach for
that kind of a pointless fight over something with such marginal relevance
and benefit.

If people want to *fix* this, assuming that "this" is an actual problem,
then the issue, as I see it, isn't really block ranges per se, which don't
mean a whole lot outside of regex expressions that may use them.
Instead, the issue is the de facto alignment of chart presentation with
block boundaries. Jiggering the chart production to *present* the
range FB50..FDFF as three *chart* units, instead of one, would solve
most of the problem for all but the most hardcore Unicode metaphysicians
out there. ;-)

BTW, for those worried about the FDD0..FDEF range on noncharacters
having to live in a mixed neighborhood in the Arabic Presentation Forms-A
block, remember that we have lived since 2002 with the BOM itself 
residing in the Arabic Presentation Forms-B
block. Nobody seems to get too worked up any more about that particular
funky address.

--Ken




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-30 Thread Asmus Freytag

On 5/30/2014 11:26 AM, Karl Williamson wrote:

I'm having a problem with this
http://www.unicode.org/versions/corrigendum9.html


You are not alone.


Some people now think it means that noncharacters are really no 
different from private-use characters, and should be treated very 
similarly if not identically.


It seems to me that they should be illegal in open interchange, or 
perhaps illegal in interchange without prior agreement.


Any system (process or group of related, cooperating processes) that 
uses noncharacters will want to not have any of the ones it uses 
present in its inputs.  It will want to filter them out of those 
inputs, likely turning each into a REPLACEMENT CHARACTER. If it fails 
to do that, it leaves itself vulnerable to an attack by hackers, who 
can fool it into thinking the input data is different from what it 
really is.


Hence, a system that creates outputs containing noncharacters cannot 
be assured that any other system will accept those noncharacters.


Thus, I don't see how noncharacters can be considered to be valid in 
public interchange, given that the producers have to assume that the 
consumers will not accept them.  Producers can assume that consumers 
will accept private-use characters, though they may not know their 
intent.


This is an important distinction.

One of the concerns was that people felt that they had to have "data 
pipeline" style implementations (tools) go and filter these out - even 
if there was no intent for the implementation to use them internally in 
any way. Making clear that the standard does not require filtering 
allows for cleaner implementations of such ("path through) tools.


However, like you, I feel that the corrigendum went to far.


I think the text in 6.2 section 16.7 is good and does not need to be 
changed: "Noncharacters ... are forbidden for use in open interchange 
of Unicode text data"


Perhaps a bit better wording would be, "are forbidden for use in 
interchange of Unicode text data without prior agreement"


The only reason I can think of for your too-large (in my opinion) 
backing away from what TUS has said about noncharacters since their 
inception is to accommodate processes that conform to C7, "that 
purports to not modify the interpretation of a valid coded character 
sequence". But, I think there is a better way to do that than what 
Corrigendum #9 currently says.


I also am curious as to why the consecutive group of 32 noncharacters 
can't be split off into its own block instead of being part of an 
Arabic one.  I'm unaware of any stability policy forbidding this.  
Another block is to be split, if I recall correctly, to accommodate 
the new Cherokee characters.


This might have been possible at the time these were added, but now it 
is probably not feasible. One of the reasons is that block names are 
exposed (for better or for worse) as character properties and as such 
are also exposed in regular expressions. While not recommended, it would 
be really bad if the expression with pseudo-code 
"IsInArabicPresentationFormB(x)" were to fail, because we split the 
block into three (with the middle one being the noncharacters).


It's the usual dance: is it better to prevent such breakage, or is it 
better to not pile up more "exceptions" like noncharacters being filed 
under Arabic Presentation forms. The damage from the former is direct 
and immediate and eventually decays. The damage from the latter is 
subtle and cumulative over time.


Tough choice.

A./

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-22 Thread Richard Wordingham
On Thu, 21 Feb 2013 15:26:09 -0800
Markus Scherer  wrote:

> On Thu, Feb 21, 2013 at 2:12 PM, Richard Wordingham <
> richard.wording...@ntlworld.com> wrote:

> Nothing requires a library that processes 16-bit Unicode strings to
> have a 16-bit type for a single-character return value. Just like the
> C standard getc() returns a *negative* EOF value, in an integer type
> that is wider than a byte.

0x for WEOF looks like a hang-over from 16-bit int; changing from
it does not seem easy.  Fortunately, one can successfully read past
U+ in a file, unlike ctrl/Z in a DOS text file.

> The UTC is now applying additional pressure for the making of the
> > distinction between UTF-16 and UTF-16LE.
 
> The UTC is doing no such thing. Nothing has changed with regard to the
> UTF-16 encoding scheme and the BOM.

I didn't say the application of pressure was deliberate.

> U+FFFE has always been a code point that will never have a real
> character assigned to it, that's why it is *unlikely* to appear as
> the first character in a text file and thus useful as a "reverse
> BOM". However, it was never forbidden from occurring in the text.

It's support was not encouraged, and it was forbidden from interchanged
text.  This particular noncharacter is still forbidden in XML Version
1.0.

TUS 1.0.0 Section 2.4 forbade U+FFFE and U+.  TUS 2.0.0 Section 2.3
is less strict:

"Two codes are not used to encode characters: U+ is reserved for
internal use (as a sentinel) and should not be transmitted or stored
as part of plain text.  U+FFFE is also reserved.  Its presence may
indicate byte-swapped Unicode data."

That paragraph legitimised the use of 0x for WEOF. Note that wint_t
and wchar_t are explicitly allowed to be the same type; what is
required is that no character be encoded by WEOF.

> Best practice for file encodings has always been to declare the
> encoding.

In general it can't be declared in the plainest of plain text, except
possibly as a file attribute separate to the file content.

> Second best for UTF-16 is to always include the BOM, even if the byte
> order is big-endian. And since most computers are little-endian, they
> need to include the BOM in UTF-16 file encodings anyway (if they use
> their native endianness).

A higher-order protocol seems to work fine.  At least, it did with
Notepad on Windows XP: Windows 7 seems to be applying some
content-based checking.

Richard.



Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Markus Scherer
On Thu, Feb 21, 2013 at 2:12 PM, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> Microsoft chose WEOF=0x.  I don't think it can easily be changed to
> a better value until an incompatible processor architecture is used.
> Changing it is likely to break existing executables and object
> libraries.
>

If this is true, it's certainly a poor choice, and might violate the C
standard. (I have not checked the actual standard for wgetc(), wint_t &
WEOF.)

16-bit wchar_t doesn't exactly support 21-bit Unicode.


Right -- that's why the standard library uses a separate type, wint_t,
which can be wider if necessary.

Nothing requires a library that processes 16-bit Unicode strings to have a
16-bit type for a single-character return value. Just like the C standard
getc() returns a *negative* EOF value, in an integer type that is wider
than a byte.

The UTC is now applying additional pressure for the making of the
> distinction between UTF-16 and UTF-16LE.


The UTC is doing no such thing. Nothing has changed with regard to the
UTF-16 encoding scheme and the BOM.

U+FFFE has always been a code point that will never have a real character
assigned to it, that's why it is *unlikely* to appear as the first
character in a text file and thus useful as a "reverse BOM". However, it
was never forbidden from occurring in the text.

Best practice for file encodings has always been to declare the encoding.

Second best for UTF-16 is to always include the BOM, even if the byte order
is big-endian. And since most computers are little-endian, they need to
include the BOM in UTF-16 file encodings anyway (if they use their native
endianness).

markus


Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Richard Wordingham
On Thu, 21 Feb 2013 11:52:07 -0800
Markus Scherer  wrote:

> On Thu, Feb 21, 2013 at 11:06 AM, Richard Wordingham <
> richard.wording...@ntlworld.com> wrote:

> "fgetwc returns, as a
> wint_t,
> the wide character that corresponds to the character read or returns
> WEOF to indicate an error or end of file. For both functions, use
> feof orferror to distinguish between an error and an end-of-file
> condition." http://msdn.microsoft.com/en-us/library/c7sskzc1.aspx

> In other words, the wint_t value WEOF is supposed to be out-of-range
> for normal characters, and if in doubt, the API docs tell you to call
> feof().

Actually, you have to call both!  If both return zero, then you have
U+.  Just calling feof() would lead one, by UTC ruling, to
misdiagnose an error.

> On my Ubuntu laptop, wchar.h defines WEOF=0xu which is
> thoroughly out of range for Unicode.

Microsoft chose WEOF=0x.  I don't think it can easily be changed to
a better value until an incompatible processor architecture is used.
Changing it is likely to break existing executables and object
libraries.

> The comment for *wint_t* says
> /* Integral type unchanged by default argument promotions that can
>hold any value corresponding to members of the extended character
>set, as well as *at least one value that does not correspond to
> any*
> *   member of the extended character set*.  */
> 
> I don't have a Windows system handy to check for the value there. I
> assume that it follows the standard:

16-bit wchar_t doesn't exactly support 21-bit Unicode.  Hitherto, one
could always have tried claiming that reading U+ when expecting
ordinary characters was tantamount to interchanging code containing it,
or claimed that this part of internal usage was one of the restrictions
of the system.  The 'correction' destroys that defence.  One can still
note that U+ is not an assigned character and never will be!

>> U+FFFE at the start of a UTF-16 file should also cause some
>> headaches!
>> Doesn't Microsoft Windows still interpret this as a byte-order mark
>> without asking whether there may be a byte-order mark?
 
> In the UTF-16 *encoding scheme*, such as in an otherwise unmarked
> file, the leading bytes FF FE and FE FF have special meaning. Again,
> this has nothing to do with the first character in a string in code.
> None of this has changed.

Those believing the restrictive interpretation would not expect UTF-16LE
or UTF-16BE files to start with U+FFFE, so if the first character
appeared to be U+FFFE, they could get away with assuming it was actually
a UTF-16 file and deducing that it was not in the default endianity
assigned by the higher protocol.

The UTC is now applying additional pressure for the making of the
distinction between UTF-16 and UTF-16LE.  To be precise, if the text of
a file using the UTF-16 encoding scheme with x-endian content is to
start with U+FFFE as its first character, it must start with what would
be interpreted as U+FEFF U+FFFE if it were declared to be in the
UTF-16xE encoding scheme.  What has changed is that before such a file
could be regarded as erroneous - it should not have escaped from the
application that spawned it.  Now the question of whether it is in
the UTF-16 encoding scheme or the UTF-16xE encoding scheme needs to be
resolved.

Richard.



Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Markus Scherer
On Thu, Feb 21, 2013 at 11:06 AM, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> On Wed, 20 Feb 2013 12:49:39 -0800
> announceme...@unicode.org wrote:
>
> > They should be supported by APIs, components, and
> > applications that handle (i.e., either process or pass through) all
> > Unicode strings, such as a text editor or string class. Where an
> > application does make internal use of a noncharacter, it should take
> > some measures to sanitize input text from unknown sources.
>
> Does this mean that a general purpose application written in C that uses
> Microsoft's 16-bit wchar_t to handle little-endian UTF-16 input using
> the fgetwc() function should be regarded as broken?  The problem is
> that a return value of 0x means not non-character U+, but end
> of file!
>

"fgetwc returns, as a
wint_t,
the wide character that corresponds to the character read or returns WEOF to
indicate an error or end of file. For both functions, use feof orferror to
distinguish between an error and an end-of-file condition."
http://msdn.microsoft.com/en-us/library/c7sskzc1.aspx

In other words, the wint_t value WEOF is supposed to be out-of-range for
normal characters, and if in doubt, the API docs tell you to call feof().

On my Ubuntu laptop, wchar.h defines WEOF=0xu which is thoroughly
out of range for Unicode.

The comment for *wint_t* says
/* Integral type unchanged by default argument promotions that can
   hold any value corresponding to members of the extended character
   set, as well as *at least one value that does not correspond to any*
*   member of the extended character set*.  */

I don't have a Windows system handy to check for the value there. I assume
that it follows the standard:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/wchar.h.html says:
*wint_t*An integral type capable of storing any valid value of *wchar_t*,
or *WEOF*.

WEOFConstant expression of type *wint_t* that is returned by several WP
functions to indicate end-of-file.

Similarly, the C standard library defines EOF=*-1*, precisely so that it
cannot be mistaken for a real contents byte.

A negative sentinel value has the benefit that you need not check for
equality but can just test "<0" which makes for shorter source code and
also slightly smaller and faster machine code.

If you use an in-range value for end-of-input or something like that, then
you get into trouble. That is trivially the case, and has nothing to do
with Unicode.

U+FFFE at the start of a UTF-16 file should also cause some headaches!
> Doesn't Microsoft Windows still interpret this as a byte-order mark
> without asking whether there may be a byte-order mark?
>

In the UTF-16 *encoding scheme*, such as in an otherwise unmarked file, the
leading bytes FF FE and FE FF have special meaning. Again, this has nothing
to do with the first character in a string in code. None of this has
changed.

markus


Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Richard Wordingham
On Wed, 20 Feb 2013 12:49:39 -0800
announceme...@unicode.org wrote:

> They should be supported by APIs, components, and
> applications that handle (i.e., either process or pass through) all
> Unicode strings, such as a text editor or string class. Where an
> application does make internal use of a noncharacter, it should take
> some measures to sanitize input text from unknown sources.

Does this mean that a general purpose application written in C that uses
Microsoft's 16-bit wchar_t to handle little-endian UTF-16 input using
the fgetwc() function should be regarded as broken?  The problem is
that a return value of 0x means not non-character U+, but end
of file! 

U+FFFE at the start of a UTF-16 file should also cause some headaches!
Doesn't Microsoft Windows still interpret this as a byte-order mark
without asking whether there may be a byte-order mark?

Richard.



Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Steven Atreju
  ..
  The UTF-8, UTF-16, UTF-32 & BOM FAQ
  
  has also been updated for clarity,

Very nice, but i wonder why the paragraph on noncharacters can be
found under UTF-16 instead of under some generic, non-Microsoft
specific topic.
Thanks

  Steven