date:20140602

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️

On Mon, Jun 2, 2014 at 10:32 PM, David Starner  wrote:

> Why? It seems you're changing the rules
> ...
>
>
This isn't "are changing", it is "has changed". The Corrigendum was issued
at the start of 2013, about 16 months ago; applicable to all relevant
earlier versions. It was the result of fairly extensive debate inside the
UTC; there hasn't been a single issue on this thread that wasn't considered
during the discussions there. And as far back as 2001, the UTC made it
clear that noncharacters *are* scalar values, and are to be converted by
UTF converters. Eg, see
http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance,
one day before 9/11).

> probably trigger serious bugs in some lamebrained utility.

There were already plenty of programs that passed the noncharacters
through; very few would filter them (some would delete them, which is
horrible for security). Thinking that a utility would never encounter them
in input text was a pipe-dream. If a utility or library is so fragile that
it *breaks* on input of any valid UTF sequence, then it *is* a "lamebrained"
utility. A good unit test for any production chain would be to check there
is no crash on any input scalar value (and for that matter, any ill-formed
UTF text).
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread David Starner

On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
 wrote:
> Much as I don't like their uninvited use, it is possible to pass them
> and other undesirables through most applications by a slight bit of
> recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
> characters, one can ape UTF-16 surrogates and encode:

What's the point? If we can use the PUA, then we don't need the
noncharacters; we can just use the PUA directly. If we have to play
around with remapping them, they're pointless; they're no easier to
use in that case then ESC or '\' or PUA characters.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham

On Mon, 2 Jun 2014 15:09:21 -0700
David Starner  wrote:

> So certain programs can't use noncharacters internally because some
> people want to interchange them? That doesn't seem like what
> noncharacters should be used for.

Much as I don't like their uninvited use, it is possible to pass them
and other undesirables through most applications by a slight bit of
recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
characters, one can ape UTF-16 surrogates and encode:

32 × 64 pairs for lone surrogates
 1 × 64 pairs to replace some of the PUA characters
 1 × 35 pairs to replace the rest of the PUA characters
 1 ×  4 pairs for incoming FFFC to 
 1 × 32 pairs for the other BMP non-characters
 1 × 32 pairs for the supplementary plane non-characters.

This then frees up non-characters for the application's use.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Lisa Moore

I would like to point out to Asmus that this decision was reached 
unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC 
Berkeley, and Yahoo!

One might disagree with the decision, but there were no special favors 
involved.

Lisa 

> 
> 
> I can't shake the suspicion that Corrigendum #9 is not actually 
> solving a general problem, but is a special favor to CLDR as being 
> run by insiders, and in the process muddying the waters for everyone 
else.
> 
> A./___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy

"reserved for CLDR" would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound to the ICU implementation of CLDR for its own definitions and
syntax for its tailorings).

CLDR data defines its own interface and protocol, it can reserve these code
points only for itself but not in TUS and no other conforming plain-text
application is expected to accept these reservations, so they can
**freely** mark them in error, replace them, or filter them out, or
interpret them differently for their own usage, using their own
specification and encapsulation mechanisms and specific **non-plain-text**
data types.

CLDR data transmitted in binary form that would embed these code points are
not transporting plain-text, this is still a binary datatype specific to
this application. CLDR data must remain isolated in its scope without
forcing other protocols or TUS to follow its practices.

Other applications may develop "gateway" interfaces to convert them to be
interoperable with ICU but they are not required to do that. If they do,
they will follow the ICU specifications, not TUS and this should not
influence their own way to handle what TUS describe as plain-text.

To make it clear, it is referable to just say in TUS that the behavior of
applications with non-characters is completely undefined and unpredictable
without an external specification, and these entities should not even be
considered as encodable in any standard UTFs (which can be freely be
replaced by another one without causing any loss or modification of the
represented plain-text). It should be possible to define other (non
standard) conforming UTFs which are completely unable to represent these
non-characters (as well as any unpaired surrogate). A conforming UTF just
needs to be able to represent streams of scalar values in their full
standard range (even without knowing if they are assigned or not or without
knowing their character properties).

You can/should even design CLDR to completely ovoid the use of
non-characters: it's up to it to define an encapsulation/escaping mechanism
that clearly separates what is standard plain-text in the content and what
is not and used for specific purpose in CLDR or ICU implementations.

2014-06-03 0:07 GMT+02:00 Shawn Steele :

>  Except that, particularly the max-weight ones, mean that developers can
> be expected to use this as sentinels in code using ICU, which would
> preclude their use for other things?
>
>
>
> Which makes them more like “reserved for use in CLDR” than “noncharacters”?
>
>
>
> -Shawn
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Markus
> Scherer
> *Sent:* Monday, June 2, 2014 2:53 PM
> *To:* David Starner
> *Cc:* Unicode Mailing List
> *Subject:* Re: Corrigendum #9
>
>
>
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
> wrote:
>
>  I would especially discourage any web browser from handling
>
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
>
>
>
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
>
>
>
> markus
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy

I better expect: "treat them as you like", there will never be any warranty
of interoperability, everyone is allowed to use them as they want and even
change it at any time. The behavior is not defined in TUS, and users cannot
expect that TUS will define this behavior.
There's no clear solution about what to do if you encounter them in data
supposed to be text. For me they are not text, so the whole data could be
rejected or the text remaining after some filtering may be galsely
interpreted. You need an external specification outside TUS.

I certainly do not consider non-characters like unassigned valid code
points where applications are strongly encouraged to not apply any kinf of
filter if they want to remain compatible with evolutions of the standard
that may assign them (the best you can do with unassigned code points is
treat them as symbols, with the minimial properties defined in the standard
(notably Bidi properties according to their range, where this direction is
defined in some ranges, or treat them as symbols with weak direction), even
if applications cannot still render them (renderers will find a way to show
them, generally using a .notdef glyph like empty boxes). Normalizers will
also not mix them (the default combining class should be 0).

Only applications that want to ensure that the text conforms to a specific
version of the standard are allowed to filter out or signal as errors the
presence of unassigned code points. But all applications can do that kind
of things on non-characters (or any code unit whose value falls outside the
valid range of a defined UTFà. This is an important difference.
non-characters are not like unassigned code points, they are assigned to be
considered invalid and filterable by design by any Unicode conforming
process for handling text.

2014-06-02 23:53 GMT+02:00 Markus Scherer :

> On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
> wrote:
>
>> I would especially discourage any web browser from handling
>> these; they're noncharacters used for unknown purposes that are
>> undisplayable and if used carelessly for their stated purpose, can
>> probably trigger serious bugs in some lamebrained utility.
>>
>
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
>
> markus
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

Ø  I can't shake the suspicion that Corrigendum #9 is not actually solving a 
general problem, but is a special favor to CLDR as being run by insiders, and 
in the process muddying the waters for everyone else

I think we could generalize to other scenarios so it wasn’t necessarily an 
insider scenario.  For example, I could have a string manipulation library that 
used FFFE to indicate the beginning of an identifier for a localizable 
sentence, terminated by .  Any system using FFFEid1234 would likely 
expect to be able to read the tokens in their favorite code editor.

But I’m concerned that these “conflict” with each other, and embedding the 
behavior in major programming languages doesn’t smell to me like “internal” 
use.  Clearly if I wanted to use that library in a CLDR-aware app, there is a 
potential risk for a conflict.

In the CLDR case, there *IS* a special relationship with Unicode, and perhaps 
it would be warranted to explicitly encode character(s) with the necessary 
meaning(s) to handle edge-case collation scenarios.

-Shawn
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread David Starner

On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer  wrote:
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner  wrote:
>>
>> I would especially discourage any web browser from handling
>> these; they're noncharacters used for unknown purposes that are
>> undisplayable and if used carelessly for their stated purpose, can
>> probably trigger serious bugs in some lamebrained utility.
>
>
> I don't expect "handling these" in web browsers and lamebrained utilities. I
> expect "treat like unassigned code points".

So certain programs can't use noncharacters internally because some
people want to interchange them? That doesn't seem like what
noncharacters should be used for.

Unix utilities shouldn't usually go to the trouble of messing with
them; limiting the number of changes needed for Unicode was the whole
point of UTF-8. Any program transferring them across the Internet as
text should filter them, IMO; either some lamebrained utility will
open a security hole by using them and not filtering first, or
something will filter them after security checks have been done, or
something. Unless it's a completely trusted system, text files with
these characters should be treated with extreme prejudice by the first
thing that receives them over the net.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag


On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner > wrote:


I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect "handling these" in web browsers and lamebrained 
utilities. I expect "treat like unassigned code points".




I can't shake the suspicion that Corrigendum #9 is not actually solving 
a general problem, but is a special favor to CLDR as being run by 
insiders, and in the process muddying the waters for everyone else.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy

We can still draw a line : interchange should be meant so that other
non-Unicode standards should find their way to not mixup random data within
plain-text without defining a clear encapsulation and escaping mechanism
that ensures that plain text remains isolatable.
In other words, desieng separate layers of representation and processing,
and be more imaginative when you design an application or protocol with a
better modeling.
If an application really internaly needs some non-characters, this is not
reallyfor encoding text but for the application/protocol-specific system of
encapsulation, which should be clearly identified:
- these protocols can use separate APIs for handling objects that are
composite and contain some text but that are not text by themselves.
- they should isolate data types (or MIME types)
- they should use some "magic" identifiers in the headers of their data,
including versioning in their protocol
- they should document internally their own encapsulation/escaping
mechanisms
- they should test them to make sure they preserve the valid text content
without breaking them
As the kind of data is not text, we fall within the design of binary data
formats.

These kinds of statements mean that protocols and API will be improved for
better separation of layers, working more as separate blackboxes. But it's
not up to the Unicode standard to explain how they will do it.

So for me non-characters are not Unicode text, they are not text at all and
we should not attempt to make them legal if we want to allow string designs
of isolation mechanisms that allow this separation of layers. The Unicode
standard offers enough space for this separation, with non-characters
(invalid in all standard UTFs), with onvalid code sequences in standard
UTFs that allow building up specific encodings that must not be called
"UTFs" (or "Unicode" or "UCS" or other terms defined in TUS) and identified
as such in API/protocol designs.

Thnigs would be simply better is TUS did not even define what is a
non-character and if it dd not even suggest that they are legal in "some"
circonstance of text "interchange".



2014-06-02 18:08 GMT+02:00 Mark Davis ☕️ :

> The problem is where to draw the line. In today's world, what's an app?
> You may have a cooperating system of "apps", where it is perfectly
> reasonable to interchange sentinel values (for example).
>
> I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
> where we should make it clearer.)
>
>
> Mark 
>
>  *— Il meglio è l’inimico del bene —*
>
>
> On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
> wrote:
>
>>  I also think that the verbiage swung too far the other way.  Sure, I
>> might need to save or transmit a file to talk to myself later, but apps
>> should be strongly discouraged for using these for interchange with other
>> apps.
>>
>>
>>
>> Interchange bugs are why nearly any news web site ends up with at least a
>> few articles with mangled apostrophes or whatever (because of encoding
>> differences).  Should authors’ tools or feeds or databases or whatever
>> start emitting non-characters from internal use, then we’re going to have
>> ugly leak into text “everywhere”.
>>
>>
>>
>> So I’d prefer to see text that better permitted interchange with other
>> components of an application’s internal system or partner system, yet
>> discouraged use for interchange with “foreign” apps.
>>
>>
>>
>> -Shawn
>>
>>
>>
>> ___
>> Unicode mailing list
>> Unicode@unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

Except that, particularly the max-weight ones, mean that developers can be 
expected to use this as sentinels in code using ICU, which would preclude their 
use for other things?

Which makes them more like “reserved for use in CLDR” than “noncharacters”?

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer
Sent: Monday, June 2, 2014 2:53 PM
To: David Starner
Cc: Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 1:32 PM, David Starner 
mailto:prosfil...@gmail.com>> wrote:
I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.

I don't expect "handling these" in web browsers and lamebrained utilities. I 
expect "treat like unassigned code points".

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 1:32 PM, David Starner  wrote:

> I would especially discourage any web browser from handling
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
>

I don't expect "handling these" in web browsers and lamebrained utilities.
I expect "treat like unassigned code points".

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread David Starner

On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer  wrote:
> Right, in principle. However, it should be ok to include noncharacters in
> CLDR data files for processing by CLDR implementations, and it should be
> possible to edit and diff and version-control and web-view those files etc.

Why? It seems you're changing the rules so some Unicode guys can get
oversmart in using Unicode in their systems. You could do the same
thing everyone else does and use special tags or symbols you have to
escape. I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Richard Wordingham

On Mon, 2 Jun 2014 11:29:09 +0200
Mark Davis ☕️  wrote:

> > \uD808\uDF45 specifies a sequence of two codepoints.
> 
> That is simply incorrect.

The above is in the sample notation of UTS #18 Version 17 Section 1.1.

From what I can make out, the corresponding Java notation would be
\x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match in
Java, or whether they are even acceptable.  The only thing UTS #18
RL1.7 permits them to match in Java is lone surrogates, but I don't
know if Java complies.

All UTS #18 says for sure about regular expressions matching code units
is that they don't satisfy RL1.1, though Section 1.7 appears to ban
them when it says, "A fundamental requirement is that Unicode text be
interpreted semantically by code point, not code units".  Perhaps it's
a fundamental requirement of something other than UTS #18.  I thought
matching parts of characters in terms of their canonical equivalences
was awkward enough, without having the additional option of matching
some of the code units!

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham

On Mon, 2 Jun 2014 10:17:04 -0700
Markus Scherer  wrote:

> CLDR collation data defines special contraction mappings that start
> with a noncharacter, for
> http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

> In CLDR 23 and before (when we were still using XML collation syntax),
> these were raw noncharacters in the .xml files.

> As I said earlier:
> it should be ok to include noncharacters in CLDR data files for
> processing by CLDR implementations, and it should be possible to edit
> and diff and version-control and web-view those files etc.

They come as a nasty shock when someone thinks XML files are marked-up
text files.  I'm still surprised that the published human-readable form
of CLDR files should contain automatically applied non-Unicode copyright
claims.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

Hmm, I find that disconcerting.  I’d prefer a real Unicode character with 
special weights if that concept’s needed.  And I guess that goes a long ways to 
explaining the interchange problem since clearly the code editor’s going to 
need these ☹

From: Markus Scherer [mailto:markus@gmail.com]
Sent: Monday, June 2, 2014 10:17 AM
To: Shawn Steele
Cc: Asmus Freytag; Doug Ewell; Mark Davis ☕️; Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:
To further my understanding, can someone provide examples of how these are used 
in actual practice?

CLDR collation data defines special contraction mappings that start with a 
noncharacter, for 
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax), these 
were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing by 
CLDR implementations, and it should be possible to edit and diff and 
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
wrote:

> To further my understanding, can someone provide examples of how these are
> used in actual practice?
>

CLDR collation data defines special contraction mappings that start with a
noncharacter, for
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax),
these were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing
by CLDR implementations, and it should be possible to edit and diff and
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

> Oh, look. My mail system converted those nice noncharacters into U+FFFD.
> Was that compliant? Did I deserve what I got? Are those two different 
> questions?

I think I just got spaces.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

To further my understanding, can someone provide examples of how these are used 
in actual practice?  I can't think of any offhand and the closest I get is like 
the old escape characters to get a dot matrix printer to shift modes, or old 
word processor internal formatting sequences.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag


On 6/2/2014 9:38 AM, Shawn Steele wrote:

I agree with Markus; I think the FAQ is pretty clear. (And if not,
that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)


FAQ's are useful, but they are not binding. They are even less binding 
than general explanation in the text of the Core specification, which 
itself doesn't rise to the that of conformance clauses and definition...


Doug's unease about the "upside-down" nature of the wording regarding 
PUA and noncharacters is something that should be addressed in revised 
text in the core specification.


A./


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️

I disagree with that characterization, of course.

The recommendation for libraries and low-level tools to pass them through
rather than screw with them makes them usable. The recommendation to check
for noncharacters from unknown sources and fix them was good advice then,
and is good advice now. Any app where input of noncharacters causes
security problems or crashes is and was not a very good app.

Mark 

 *— Il meglio è l’inimico del bene —*

On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag  wrote:

>  On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:
>
>
> On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
> wrote:
>
>> The “problem” is now that previously these characters were illegal
>
>
>  The problem was that we were inconsistent in standard and related
> material about just what the status was for these things.
>
>
>   And threw the baby out to fix it.
>
> A./
>
>
>  Mark 
>
>  *— Il meglio è l’inimico del bene —*
>
>
> ___
> Unicode mailing 
> listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode
>
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Doug Ewell

I wrote, sort of:
 
> Correct. Most people wouldn't consider a cooperating system like that
> quite the same as true public interchange, like throwing this ���
> into a message on a public mailing list.

Oh, look. My mail system converted those nice noncharacters into U+FFFD.
Was that compliant? Did I deserve what I got? Are those two different
questions?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

> > I agree with Markus; I think the FAQ is pretty clear. (And if not, 
> > that's where we should make it clearer.)

> But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag


On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:


On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:


The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related 
material about just what the status was for these things.




And threw the baby out to fix it.

A./


Mark 
/
/
/— Il meglio è l’inimico del bene —/
//


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag


On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote:
The problem is where to draw the line. In today's world, what's an 
app? You may have a cooperating system of "apps", where it is 
perfectly reasonable to interchange sentinel values (for example).


The way to draw the line is to insist on there being an agreement 
between sender and ultimate receiver, and an pass-through agreement (if 
you will) for any intermediate stage, so that the coast is clear.


What defines an "implementation" in this scenario, is the existence of 
the agreement.


What got us into trouble is that the negative case (pass-through) was 
not well-defined, and lead to people assuming that they had to filter 
any incoming noncharacters.


Because noncharacters can have any interpretation (not limited to 
interpretations as characters), it is much riskier to send then out 
oblivious whether the intended recipient is part of the same agreement 
on their interpretation as the sender. In that sense, they are not mere 
PUA code points.


The other aspect of their original design was to allow code points that  
recipients were free no to honor or preserve, if they were not part of 
the agreement (and hadn't made an explicit or implicit pass-through 
agreement). Otherwise, if anyone expects them to be preserved, no 
application like Word, would be free to use these for purely internal use.


Word thus would not be a tool to handle CLDR data; which may be 
disappointing to some, but should be fine.


A./


I agree with Markus; I think the FAQ is pretty clear. (And if not, 
that's where we should make it clearer.)



Mark 
/
/
/— Il meglio è l’inimico del bene —/
//


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:


I also think that the verbiage swung too far the other way.  Sure,
I might need to save or transmit a file to talk to myself later,
but apps should be strongly discouraged for using these for
interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at
least a few articles with mangled apostrophes or whatever (because
of encoding differences).  Should authors’ tools or feeds or
databases or whatever start emitting non-characters from internal
use, then we’re going to have ugly leak into text “everywhere”.

So I’d prefer to see text that better permitted interchange with
other components of an application’s internal system or partner
system, yet discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.org 
http://unicode.org/mailman/listinfo/unicode




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Doug Ewell

Shawn Steele  wrote:

> So I’d prefer to see text that better permitted interchange with other
> components of an application’s internal system or partner system, yet
> discouraged use for interchange with "foreign" apps.

If any wording is to be revised, while we're at it, I'd also like to see
a reaffirmation of the proper relationship between private-use
characters and noncharacters. I still hear arguments that private-use
characters are to be avoided in public interchange at all costs, as if
lack of knowledge of the private agreement, or conflicting
interpretations, will cause some kind of major security breach. At the
same time, the Corrigendum seems to imply that noncharacters in public
interchange are no big deal. That seems upside-down.

Mark Davis 🍝  replied:

> The problem is where to draw the line. In today's world, what's an
> app? You may have a cooperating system of "apps", where it is
> perfectly reasonable to interchange sentinel values (for example).

Correct. Most people wouldn't consider a cooperating system like that
quite the same as true public interchange, like throwing this ���
into a message on a public mailing list.

Since the Corrigendum deals with recommendations rather than hard
requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright
line is really needed.

> I agree with Markus; I think the FAQ is pretty clear. (And if not,
> that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity,
right?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️

On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
wrote:

> The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related material
about just what the status was for these things.



Mark 

 *— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

That’s what I think is exactly what should be clarified.  A cooperating system 
of apps should likely use some other markup, however if they want to use  
to say “OK to insert ad here” (or whatever), that’s up to them.

I fear that the current wording says “Because you might have a cooperating 
system of apps that all agree  is ‘OK to insert ad here’, you may as well 
emit  all the time just in case some other app happens to use the same 
sentinel”.

The “problem” is now that previously these characters were illegal, so my 
application didn’t have to explicitly remove them when importing external stuff 
because they weren’t allowed to be there.  With the wording of the corrigendum, 
the onus is on every app importing data to filter out these code points because 
they are “suddenly” legal in foreign data streams.

That is a breaking change for applications, and, worse, it isn’t in the control 
of the applications that take advantage of the newly laxer wording, but rather 
all the other applications on the planet, which may have been stable for years.

My interpretation of “interchanged” was “interchanged outside of a system that 
understood your private use of the noncharacters”.  I can see where that may 
not have been everyone’s interpretation, and maybe should be updated.  My 
interpretation of what you’re saying below is “sentinel values with a private 
meaning can be exchanged between apps”, which is what the PUA’s for.

I don’t mind at all if the definition is loosened somewhat, but if we’re 
turning them into PUA characters we should just turn them into PUA characters.

-Shawn

From: mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] On 
Behalf Of Mark Davis ??
Sent: Monday, June 2, 2014 9:08 AM
To: Shawn Steele
Cc: Markus Scherer; Doug Ewell; Unicode Mailing List
Subject: Re: Corrigendum #9

The problem is where to draw the line. In today's world, what's an app? You may 
have a cooperating system of "apps", where it is perfectly reasonable to 
interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where 
we should make it clearer.)


Mark

— Il meglio è l’inimico del bene —

On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:
I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️

The problem is where to draw the line. In today's world, what's an app? You
may have a cooperating system of "apps", where it is perfectly reasonable
to interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
where we should make it clearer.)


Mark 

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
wrote:

>  I also think that the verbiage swung too far the other way.  Sure, I
> might need to save or transmit a file to talk to myself later, but apps
> should be strongly discouraged for using these for interchange with other
> apps.
>
>
>
> Interchange bugs are why nearly any news web site ends up with at least a
> few articles with mangled apostrophes or whatever (because of encoding
> differences).  Should authors’ tools or feeds or databases or whatever
> start emitting non-characters from internal use, then we’re going to have
> ugly leak into text “everywhere”.
>
>
>
> So I’d prefer to see text that better permitted interchange with other
> components of an application’s internal system or partner system, yet
> discouraged use for interchange with “foreign” apps.
>
>
>
> -Shawn
>
>
>
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele

I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell  wrote:

> I suspect everyone can agree on the edge cases, that noncharacters are
> harmless in internal processing, but probably should not appear in
> random text shipped around on the web.
>

Right, in principle. However, it should be ok to include noncharacters in
CLDR data files for processing by CLDR implementations, and it should be
possible to edit and diff and version-control and web-view those files etc.

It seems that trying to define "interchange" and "public" in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be used,
should be preserved, or could be stripped, starting with "Q: Are
noncharacters intended for interchange?
"

In my view, those Q/A pairs explain noncharacters quite well. If there are
further examples of where noncharacters might be used, should be preserved,
or could be stripped, and that would be particularly useful to add to the
examples already there, then we could add them.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-02 Thread Doug Ewell

It seems that the broadening of the term "interchange" in this
corrigendum to mean "almost any type of processing imaginable," below,
is what caused the trouble. This is the decision that would need to be
reconsidered if the real intent of noncharacters is to be expressed.

I suspect everyone can agree on the edge cases, that noncharacters are
harmless in internal processing, but probably should not appear in
random text shipped around on the web.

> This is necessary for the effective use of noncharacters, because
> anytime a Unicode string crosses an API boundary, it is in effect
> being "interchanged". Furthermore, for distributed software, it is
> often very difficult to determine what constitutes an "internal"
> versus an "external" context for any particular software process.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Philippe Verdy

Your example would have been better explained by just saying that in Java,
the regexp represented in source code as "\\uD808\\uDF45" means matching
two successive 16-bit code units, and "\\uD808" or "\\uDF45" just matches
one.

The "\\u" regexo notation (in source code, equivalentto "\u" in
string at runtime) does not designate necessarily a full code point.

Unlike the "\\x{}" and "." regexs which will necessarily match a full
code point in the target (even if it's an isolated surrogate).

But there's no way in Java to represent a target string that can store
arbitrary sequences of codepoints if you use the String type (this is not
specific to Java but applies as well to any language or runtime library
handling streams of 16-bit code units, including in C, C++, Python,
Javascript, PHP...).

The problem is then not in the way you write regexps, but in the way the
target string is encoded : it is not technically posible with 16-bit
streams to represent arbitrary sequences of codepoints, but only arbitrary
sequences of 16-bit code units (even if they aren't valid UTF-16 text). But
there's no problem at all to process valid UTF-16 streams.

Your "lead", "trail" and "one+" are representable in Java as arbitrary
16-bit streams but they do not represent not valid Unicode texts. On the
opposite all your "tests[]" strings are valid Unicode texts but their
interpretation as regexps are not necessarily valid regexps.

Each time you use single backslashes in a Java source-code string, there's
no warranty it will be a valid Unicode text even though it will compile
without problem as a valid 16-bit stream (and the same will be true in
other languages).

If you want to represent aribtrary sequences of codepoints in a target
text, you cannot use any UTF alone (it may be technically possible with
UTF-8 or UTF-32, but these are also invalid for these standard encodings),
without using an escaping mechanism such as the double backslashes like in
the notation of regexps. This escaping mechnims is then independant of the
actual runtime encoding used to transport the escaped streams within valid
Unicode texts.

In summary; arbitrary sequences of codepoints in a valid Unicode text
require escaping mechanism on top of the actual text encoding for the
storage or transport (there are other ways to escape arbitrary streams into
valid texts, including the U+NN notation, or Base64 or Hex or octal
representation of UTF-32, or Punycode. and many other technics used to
embed binary objets (UUCP, Postscript streams). In HTTP a few of them are
suported as standard "transport syntaxes". Terminal protocols (like VT220
and related, or Videotext) have since long used escape sequences (plus
controls like SI/SO encapsulation and isolated DLE escapes for transporting
8-bit data over a 7-bit stream)

Technically the Java strings at runtime are not plain text (unless they are
checked on input and the validaty conditions are not brokeb by some text
transforms like extraction ob substrings at arbitrary absolute positions,
or with error recovery with resynchronization after a failure or missing
data, where these errors are likely to occur because we have no warranty
that validity is kept during the exchange by matching preconditions and
postconditions), they are binary object (and this is also true for C/C++
standard strings, or PHP strings, or the content transported by an HTTP
session or a terminal protocol (defining also its own escaping mechanism
where needed).

If yuo develop a general purpose library in any language that can be reused
in arbitrary code, you cannot assume on input that all preconditions are
satisfied so you need to check the input. And you also have to be careful
about the design of your library to make sure that it respects the
postconditions (some library APIs are technically unsafe, notably
extracting substrings and almost blocked I/O using fixed size buffers such
as file I/O in filesystems that do not discritimate text files and binary
files (so that text files will use buffers with variable length only broken
at codepoint positions and not at arbitrary code unit positions.

As far as I know, there does not exist any filesystem that enforce code
point positions (unless it uses non-space efficient encodings with code
units wider than 20 bits (storage devices are optimized for code units wth
size that are a power of 2 in bytes, so you would finally use only files
whose sizes in bytes is a multiple of 4 and all random access file
positions also a multiple of 4 bytes.

You could also use 24-but storage code units with blocks limited to sectors
of 255 bytes with the extra byte only used as a filler or as a length
indicator in that sector (255 bytes would store 85 arbitrary code units of
24 bits but you will still need to check the value range of these code
units if you want to restrict the the U+.U+10 codepoint space,
unless your application code handles all of the extra code units like
non-character code

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Mark Davis ☕️

> \uD808\uDF45 specifies a sequence of two codepoints.

That is simply incorrect.

In Java (and similar environments), \u means a char (a UTF16 code
unit), not a code point. Here is the difference. If you are not used to
Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
with the replacement y in string. Backslashes in literals need escaping, so
\x needs to be written in literals as \\x.

String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45",
"«.»"};
String target =
 "one: «\uD808\uDF45»\t\t" +
"two: «\uD808\uDF45\uD808\uDF45»\t\t" +
"lead: «\uD808»\t\t" +
"trail: «\uDF45»\t\t" +
"one+: «\uD808\uDF45\uD808»";
System.out.println("pattern" + "\t→\t" + target + "\n");
for (String test : tests) {
  System.out.println(test + "\t→\t" + target.replaceAll(test, "§︎"));
}


*Output:*
pattern → one: «𒍅» two: «𒍅𒍅» lead: «?» trail: «?» one+: «𒍅?»

\x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
\uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
𒍅 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
«.» → one: §︎ two: «𒍅𒍅» lead: §︎ trail: §︎ one+: «𒍅?»

The target has various combinations of code units, to see what happens.
Notice that Java treats a pair of lead+trail as a single code point for
matching (eg .), but also an isolated surrogate char as a single code point
(last line of output). Note that Java's regex in addition allows \x{hex}
for specifying a code point explicitly. It also has the syntax \u (in a
literal the \ needs escaping) to specify a code unit; that is slightly
different than the Java preprocessing. Thus the first two are equivalent,
and replace "{" by "x". The last two are also equivalent—and fail—because a
single "{" is a broken regex pattern.

System.out.println("{".replaceAll("\\u007B", "x"));
System.out.println("{".replaceAll("\\x{7B}", "x"));

System.out.println("{".replaceAll("\u007B", "x"));
System.out.println("{".replaceAll("{", "x"));



Mark 

 *— Il meglio è l’inimico del bene —*


On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> On Sun, 1 Jun 2014 08:58:26 -0700
> Markus Scherer  wrote:
>
> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
> > supplementary code point, but as long as you have a surrogate pair,
> > it is treated as a code point in APIs that support them.
>
> Wasn't obvious that in the following paragraph \uD808\uDF45 was a
> pattern?
>
> "Bear in mind that a pattern \uD808 shall not match anything in a
> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
> string and before Unicode 5.2 could readily be taken to occur in an
> ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
> expression engine, the codepoint sequence  cannot
> occur in a UTF-16 Unicode string; instead, the code unit sequence  DF45> is the codepoint sequence  KI>."
>
> (It might have been clearer to you if I'd said '8-bit' and '16-bit'
> instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
> 16-bit encoding of arbitrary *codepoint* sequences.)
>
> Richard.
> ___
> Unicode mailing list
> Unicode@unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

RE: Corrigendum #9

Re: Corrigendum #9

Re: Corrigendum #9

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

34 matches

Site Navigation

Mail list logo

Footer information