Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Mark Davis ☕️
 \uD808\uDF45 specifies a sequence of two codepoints.

​That is simply incorrect.​

In Java (and similar environments), \u means a char (a UTF16 code
unit), not a code point. Here is the difference. If you are not used to
Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
with the replacement y in string. Backslashes in literals need escaping, so
\x needs to be written in literals as \\x.

String[] tests = {\\x{12345}, \\uD808\\uDF45, \uD808\uDF45,
«.»};
String target =
 one: «\uD808\uDF45»\t\t +
two: «\uD808\uDF45\uD808\uDF45»\t\t +
lead: «\uD808»\t\t +
trail: «\uDF45»\t\t +
one+: «\uD808\uDF45\uD808»;
System.out.println(pattern + \t→\t + target + \n);
for (String test : tests) {
  System.out.println(test + \t→\t + target.replaceAll(test, §︎));
}


*​Output:*
pattern → one: «⍅» two: «⍅⍅» lead: «?» trail: «?» one+: «⍅?»

\x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
\uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
⍅ → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
«.» → one: §︎ two: «⍅⍅» lead: §︎ trail: §︎ one+: «⍅?»

The target has various combinations of code units, to see what happens.
Notice that Java treats a pair of lead+trail as a single code point for
matching (eg .), but also an isolated surrogate char as a single code point
(last line of output). Note that Java's regex in addition allows \x{hex}
for specifying a code point explicitly. It also has the syntax \u (in a
literal the \ needs escaping) to specify a code unit; that is slightly
different than the Java preprocessing. Thus the first two are equivalent,
and replace { by x. The last two are also equivalent—and fail—because a
single { is a broken regex pattern.

System.out.println({.replaceAll(\\u007B, x));
System.out.println({.replaceAll(\\x{7B}, x));

System.out.println({.replaceAll(\u007B, x));
System.out.println({.replaceAll({, x));



Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 On Sun, 1 Jun 2014 08:58:26 -0700
 Markus Scherer markus@gmail.com wrote:

  You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
  supplementary code point, but as long as you have a surrogate pair,
  it is treated as a code point in APIs that support them.

 Wasn't obvious that in the following paragraph \uD808\uDF45 was a
 pattern?

 Bear in mind that a pattern \uD808 shall not match anything in a
 well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
 codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
 string and before Unicode 5.2 could readily be taken to occur in an
 ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
 expression engine, the codepoint sequence U+D808, U+DF45 cannot
 occur in a UTF-16 Unicode string; instead, the code unit sequence D808
 DF45 is the codepoint sequence U+12345 CUNEIFORM SIGN URU TIMES
 KI.

 (It might have been clearer to you if I'd said '8-bit' and '16-bit'
 instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
 16-bit encoding of arbitrary *codepoint* sequences.)

 Richard.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Philippe Verdy
Your example would have been better explained by just saying that in Java,
the regexp represented in source code as \\uD808\\uDF45 means matching
two successive 16-bit code units, and \\uD808 or \\uDF45 just matches
one.

The \\u regexo notation (in source code, equivalentto \u in
string at runtime) does not designate necessarily a full code point.

Unlike the \\x{} and . regexs which will necessarily match a full
code point in the target (even if it's an isolated surrogate).

But there's no way in Java to represent a target string that can store
arbitrary sequences of codepoints if you use the String type (this is not
specific to Java but applies as well to any language or runtime library
handling streams of 16-bit code units, including in C, C++, Python,
Javascript, PHP...).

The problem is then not in the way you write regexps, but in the way the
target string is encoded : it is not technically posible with 16-bit
streams to represent arbitrary sequences of codepoints, but only arbitrary
sequences of 16-bit code units (even if they aren't valid UTF-16 text). But
there's no problem at all to process valid UTF-16 streams.

Your lead, trail and one+ are representable in Java as arbitrary
16-bit streams but they do not represent not valid Unicode texts. On the
opposite all your tests[] strings are valid Unicode texts but their
interpretation as regexps are not necessarily valid regexps.

Each time you use single backslashes in a Java source-code string, there's
no warranty it will be a valid Unicode text even though it will compile
without problem as a valid 16-bit stream (and the same will be true in
other languages).

If you want to represent aribtrary sequences of codepoints in a target
text, you cannot use any UTF alone (it may be technically possible with
UTF-8 or UTF-32, but these are also invalid for these standard encodings),
without using an escaping mechanism such as the double backslashes like in
the notation of regexps. This escaping mechnims is then independant of the
actual runtime encoding used to transport the escaped streams within valid
Unicode texts.

In summary; arbitrary sequences of codepoints in a valid Unicode text
require escaping mechanism on top of the actual text encoding for the
storage or transport (there are other ways to escape arbitrary streams into
valid texts, including the U+NN notation, or Base64 or Hex or octal
representation of UTF-32, or Punycode. and many other technics used to
embed binary objets (UUCP, Postscript streams). In HTTP a few of them are
suported as standard transport syntaxes. Terminal protocols (like VT220
and related, or Videotext) have since long used escape sequences (plus
controls like SI/SO encapsulation and isolated DLE escapes for transporting
8-bit data over a 7-bit stream)

Technically the Java strings at runtime are not plain text (unless they are
checked on input and the validaty conditions are not brokeb by some text
transforms like extraction ob substrings at arbitrary absolute positions,
or with error recovery with resynchronization after a failure or missing
data, where these errors are likely to occur because we have no warranty
that validity is kept during the exchange by matching preconditions and
postconditions), they are binary object (and this is also true for C/C++
standard strings, or PHP strings, or the content transported by an HTTP
session or a terminal protocol (defining also its own escaping mechanism
where needed).

If yuo develop a general purpose library in any language that can be reused
in arbitrary code, you cannot assume on input that all preconditions are
satisfied so you need to check the input. And you also have to be careful
about the design of your library to make sure that it respects the
postconditions (some library APIs are technically unsafe, notably
extracting substrings and almost blocked I/O using fixed size buffers such
as file I/O in filesystems that do not discritimate text files and binary
files (so that text files will use buffers with variable length only broken
at codepoint positions and not at arbitrary code unit positions.

As far as I know, there does not exist any filesystem that enforce code
point positions (unless it uses non-space efficient encodings with code
units wider than 20 bits (storage devices are optimized for code units wth
size that are a power of 2 in bytes, so you would finally use only files
whose sizes in bytes is a multiple of 4 and all random access file
positions also a multiple of 4 bytes.

You could also use 24-but storage code units with blocks limited to sectors
of 255 bytes with the extra byte only used as a filler or as a length
indicator in that sector (255 bytes would store 85 arbitrary code units of
24 bits but you will still need to check the value range of these code
units if you want to restrict the the U+.U+10 codepoint space,
unless your application code handles all of the extra code units like
non-character code points)

However the 

Re: Corrigendum #9

2014-06-02 Thread Doug Ewell
It seems that the broadening of the term interchange in this
corrigendum to mean almost any type of processing imaginable, below,
is what caused the trouble. This is the decision that would need to be
reconsidered if the real intent of noncharacters is to be expressed.

I suspect everyone can agree on the edge cases, that noncharacters are
harmless in internal processing, but probably should not appear in
random text shipped around on the web.

 This is necessary for the effective use of noncharacters, because
 anytime a Unicode string crosses an API boundary, it is in effect
 being interchanged. Furthermore, for distributed software, it is
 often very difficult to determine what constitutes an internal
 versus an external context for any particular software process.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org wrote:

 I suspect everyone can agree on the edge cases, that noncharacters are
 harmless in internal processing, but probably should not appear in
 random text shipped around on the web.


Right, in principle. However, it should be ok to include noncharacters in
CLDR data files for processing by CLDR implementations, and it should be
possible to edit and diff and version-control and web-view those files etc.

It seems that trying to define interchange and public in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be used,
should be preserved, or could be stripped, starting with Q: Are
noncharacters intended for interchange?
http://www.unicode.org/faq/private_use.html#nonchar6

In my view, those Q/A pairs explain noncharacters quite well. If there are
further examples of where noncharacters might be used, should be preserved,
or could be stripped, and that would be particularly useful to add to the
examples already there, then we could add them.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
The problem is where to draw the line. In today's world, what's an app? You
may have a cooperating system of apps, where it is perfectly reasonable
to interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
where we should make it clearer.)


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.com
wrote:

  I also think that the verbiage swung too far the other way.  Sure, I
 might need to save or transmit a file to talk to myself later, but apps
 should be strongly discouraged for using these for interchange with other
 apps.



 Interchange bugs are why nearly any news web site ends up with at least a
 few articles with mangled apostrophes or whatever (because of encoding
 differences).  Should authors’ tools or feeds or databases or whatever
 start emitting non-characters from internal use, then we’re going to have
 ugly leak into text “everywhere”.



 So I’d prefer to see text that better permitted interchange with other
 components of an application’s internal system or partner system, yet
 discouraged use for interchange with “foreign” apps.



 -Shawn



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
That’s what I think is exactly what should be clarified.  A cooperating system 
of apps should likely use some other markup, however if they want to use  
to say “OK to insert ad here” (or whatever), that’s up to them.

I fear that the current wording says “Because you might have a cooperating 
system of apps that all agree  is ‘OK to insert ad here’, you may as well 
emit  all the time just in case some other app happens to use the same 
sentinel”.

The “problem” is now that previously these characters were illegal, so my 
application didn’t have to explicitly remove them when importing external stuff 
because they weren’t allowed to be there.  With the wording of the corrigendum, 
the onus is on every app importing data to filter out these code points because 
they are “suddenly” legal in foreign data streams.

That is a breaking change for applications, and, worse, it isn’t in the control 
of the applications that take advantage of the newly laxer wording, but rather 
all the other applications on the planet, which may have been stable for years.

My interpretation of “interchanged” was “interchanged outside of a system that 
understood your private use of the noncharacters”.  I can see where that may 
not have been everyone’s interpretation, and maybe should be updated.  My 
interpretation of what you’re saying below is “sentinel values with a private 
meaning can be exchanged between apps”, which is what the PUA’s for.

I don’t mind at all if the definition is loosened somewhat, but if we’re 
turning them into PUA characters we should just turn them into PUA characters.

-Shawn

From: mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] On 
Behalf Of Mark Davis ??
Sent: Monday, June 2, 2014 9:08 AM
To: Shawn Steele
Cc: Markus Scherer; Doug Ewell; Unicode Mailing List
Subject: Re: Corrigendum #9

The problem is where to draw the line. In today's world, what's an app? You may 
have a cooperating system of apps, where it is perfectly reasonable to 
interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where 
we should make it clearer.)


Markhttps://google.com/+MarkDavis

— Il meglio è l’inimico del bene —

On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.orgmailto:Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com
wrote:

 The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related material
about just what the status was for these things.



Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
Shawn Steele Shawn dot Steele at microsoft dot com wrote:

 So I’d prefer to see text that better permitted interchange with other
 components of an application’s internal system or partner system, yet
 discouraged use for interchange with foreign apps.

If any wording is to be revised, while we're at it, I'd also like to see
a reaffirmation of the proper relationship between private-use
characters and noncharacters. I still hear arguments that private-use
characters are to be avoided in public interchange at all costs, as if
lack of knowledge of the private agreement, or conflicting
interpretations, will cause some kind of major security breach. At the
same time, the Corrigendum seems to imply that noncharacters in public
interchange are no big deal. That seems upside-down.

Mark Davis  mark at macchiato dot com replied:

 The problem is where to draw the line. In today's world, what's an
 app? You may have a cooperating system of apps, where it is
 perfectly reasonable to interchange sentinel values (for example).

Correct. Most people wouldn't consider a cooperating system like that
quite the same as true public interchange, like throwing this ���
into a message on a public mailing list.

Since the Corrigendum deals with recommendations rather than hard
requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright
line is really needed.

 I agree with Markus; I think the FAQ is pretty clear. (And if not,
 that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity,
right?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:


On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele 
shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote:


The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related 
material about just what the status was for these things.




And threw the baby out to fix it.

A./


Mark https://google.com/+MarkDavis
/
/
/— Il meglio è l’inimico del bene —/
//


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote:
The problem is where to draw the line. In today's world, what's an 
app? You may have a cooperating system of apps, where it is 
perfectly reasonable to interchange sentinel values (for example).


The way to draw the line is to insist on there being an agreement 
between sender and ultimate receiver, and an pass-through agreement (if 
you will) for any intermediate stage, so that the coast is clear.


What defines an implementation in this scenario, is the existence of 
the agreement.


What got us into trouble is that the negative case (pass-through) was 
not well-defined, and lead to people assuming that they had to filter 
any incoming noncharacters.


Because noncharacters can have any interpretation (not limited to 
interpretations as characters), it is much riskier to send then out 
oblivious whether the intended recipient is part of the same agreement 
on their interpretation as the sender. In that sense, they are not mere 
PUA code points.


The other aspect of their original design was to allow code points that  
recipients were free no to honor or preserve, if they were not part of 
the agreement (and hadn't made an explicit or implicit pass-through 
agreement). Otherwise, if anyone expects them to be preserved, no 
application like Word, would be free to use these for purely internal use.


Word thus would not be a tool to handle CLDR data; which may be 
disappointing to some, but should be fine.


A./


I agree with Markus; I think the FAQ is pretty clear. (And if not, 
that's where we should make it clearer.)



Mark https://google.com/+MarkDavis
/
/
/— Il meglio è l’inimico del bene —/
//


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote:


I also think that the verbiage swung too far the other way.  Sure,
I might need to save or transmit a file to talk to myself later,
but apps should be strongly discouraged for using these for
interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at
least a few articles with mangled apostrophes or whatever (because
of encoding differences).  Should authors’ tools or feeds or
databases or whatever start emitting non-characters from internal
use, then we’re going to have ugly leak into text “everywhere”.

So I’d prefer to see text that better permitted interchange with
other components of an application’s internal system or partner
system, yet discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.org mailto:Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
  I agree with Markus; I think the FAQ is pretty clear. (And if not, 
  that's where we should make it clearer.)

 But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
I wrote, sort of:
 
 Correct. Most people wouldn't consider a cooperating system like that
 quite the same as true public interchange, like throwing this ���
 into a message on a public mailing list.

Oh, look. My mail system converted those nice noncharacters into U+FFFD.
Was that compliant? Did I deserve what I got? Are those two different
questions?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
I disagree with that characterization, of course.

The recommendation for libraries and low-level tools to pass them through
rather than screw with them makes them usable. The recommendation to check
for noncharacters from unknown sources and fix them was good advice then,
and is good advice now. Any app where input of noncharacters causes
security problems or crashes is and was not a very good app.


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag asm...@ix.netcom.com wrote:

  On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:


 On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com
 wrote:

 The “problem” is now that previously these characters were illegal


  The problem was that we were inconsistent in standard and related
 material about just what the status was for these things.


   And threw the baby out to fix it.

 A./


  Mark https://google.com/+MarkDavis

  *— Il meglio è l’inimico del bene —*


 ___
 Unicode mailing 
 listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 9:38 AM, Shawn Steele wrote:

I agree with Markus; I think the FAQ is pretty clear. (And if not,
that's where we should make it clearer.)

But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)


FAQ's are useful, but they are not binding. They are even less binding 
than general explanation in the text of the Core specification, which 
itself doesn't rise to the that of conformance clauses and definition...


Doug's unease about the upside-down nature of the wording regarding 
PUA and noncharacters is something that should be addressed in revised 
text in the core specification.


A./


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
To further my understanding, can someone provide examples of how these are used 
in actual practice?  I can't think of any offhand and the closest I get is like 
the old escape characters to get a dot matrix printer to shift modes, or old 
word processor internal formatting sequences.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
 Oh, look. My mail system converted those nice noncharacters into U+FFFD.
 Was that compliant? Did I deserve what I got? Are those two different 
 questions?

I think I just got spaces.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.com
wrote:

 To further my understanding, can someone provide examples of how these are
 used in actual practice?


CLDR collation data defines special contraction mappings that start with a
noncharacter, for
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax),
these were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing
by CLDR implementations, and it should be possible to edit and diff and
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Hmm, I find that disconcerting.  I’d prefer a real Unicode character with 
special weights if that concept’s needed.  And I guess that goes a long ways to 
explaining the interchange problem since clearly the code editor’s going to 
need these ☹

From: Markus Scherer [mailto:markus@gmail.com]
Sent: Monday, June 2, 2014 10:17 AM
To: Shawn Steele
Cc: Asmus Freytag; Doug Ewell; Mark Davis ☕️; Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
To further my understanding, can someone provide examples of how these are used 
in actual practice?

CLDR collation data defines special contraction mappings that start with a 
noncharacter, for 
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax), these 
were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing by 
CLDR implementations, and it should be possible to edit and diff and 
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 10:17:04 -0700
Markus Scherer markus@gmail.com wrote:

 CLDR collation data defines special contraction mappings that start
 with a noncharacter, for
 http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

 In CLDR 23 and before (when we were still using XML collation syntax),
 these were raw noncharacters in the .xml files.

 As I said earlier:
 it should be ok to include noncharacters in CLDR data files for
 processing by CLDR implementations, and it should be possible to edit
 and diff and version-control and web-view those files etc.

They come as a nasty shock when someone thinks XML files are marked-up
text files.  I'm still surprised that the published human-readable form
of CLDR files should contain automatically applied non-Unicode copyright
claims.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 11:29:09 +0200
Mark Davis ☕️ m...@macchiato.com wrote:

  \uD808\uDF45 specifies a sequence of two codepoints.
 
 ​That is simply incorrect.​

The above is in the sample notation of UTS #18 Version 17 Section 1.1.

From what I can make out, the corresponding Java notation would be
\x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match in
Java, or whether they are even acceptable.  The only thing UTS #18
RL1.7 permits them to match in Java is lone surrogates, but I don't
know if Java complies.

All UTS #18 says for sure about regular expressions matching code units
is that they don't satisfy RL1.1, though Section 1.7 appears to ban
them when it says, A fundamental requirement is that Unicode text be
interpreted semantically by code point, not code units.  Perhaps it's
a fundamental requirement of something other than UTS #18.  I thought
matching parts of characters in terms of their canonical equivalences
was awkward enough, without having the additional option of matching
some of the code units!

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer markus@gmail.com wrote:
 Right, in principle. However, it should be ok to include noncharacters in
 CLDR data files for processing by CLDR implementations, and it should be
 possible to edit and diff and version-control and web-view those files etc.

Why? It seems you're changing the rules so some Unicode guys can get
oversmart in using Unicode in their systems. You could do the same
thing everyone else does and use special tags or symbols you have to
escape. I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote:

 I would especially discourage any web browser from handling
 these; they're noncharacters used for unknown purposes that are
 undisplayable and if used carelessly for their stated purpose, can
 probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained utilities.
I expect treat like unassigned code points.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag

On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com 
mailto:prosfil...@gmail.com wrote:


I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained 
utilities. I expect treat like unassigned code points.




I can't shake the suspicion that Corrigendum #9 is not actually solving 
a general problem, but is a special favor to CLDR as being run by 
insiders, and in the process muddying the waters for everyone else.


A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer markus@gmail.com wrote:
 On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote:

 I would especially discourage any web browser from handling
 these; they're noncharacters used for unknown purposes that are
 undisplayable and if used carelessly for their stated purpose, can
 probably trigger serious bugs in some lamebrained utility.


 I don't expect handling these in web browsers and lamebrained utilities. I
 expect treat like unassigned code points.

So certain programs can't use noncharacters internally because some
people want to interchange them? That doesn't seem like what
noncharacters should be used for.

Unix utilities shouldn't usually go to the trouble of messing with
them; limiting the number of changes needed for Unicode was the whole
point of UTF-8. Any program transferring them across the Internet as
text should filter them, IMO; either some lamebrained utility will
open a security hole by using them and not filtering first, or
something will filter them after security checks have been done, or
something. Unless it's a completely trusted system, text files with
these characters should be treated with extreme prejudice by the first
thing that receives them over the net.

-- 
Kie ekzistas vivo, ekzistas espero.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Ø  I can't shake the suspicion that Corrigendum #9 is not actually solving a 
general problem, but is a special favor to CLDR as being run by insiders, and 
in the process muddying the waters for everyone else

I think we could generalize to other scenarios so it wasn’t necessarily an 
insider scenario.  For example, I could have a string manipulation library that 
used FFFE to indicate the beginning of an identifier for a localizable 
sentence, terminated by .  Any system using FFFEid1234 would likely 
expect to be able to read the tokens in their favorite code editor.

But I’m concerned that these “conflict” with each other, and embedding the 
behavior in major programming languages doesn’t smell to me like “internal” 
use.  Clearly if I wanted to use that library in a CLDR-aware app, there is a 
potential risk for a conflict.

In the CLDR case, there *IS* a special relationship with Unicode, and perhaps 
it would be warranted to explicitly encode character(s) with the necessary 
meaning(s) to handle edge-case collation scenarios.

-Shawn
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
I better expect: treat them as you like, there will never be any warranty
of interoperability, everyone is allowed to use them as they want and even
change it at any time. The behavior is not defined in TUS, and users cannot
expect that TUS will define this behavior.
There's no clear solution about what to do if you encounter them in data
supposed to be text. For me they are not text, so the whole data could be
rejected or the text remaining after some filtering may be galsely
interpreted. You need an external specification outside TUS.

I certainly do not consider non-characters like unassigned valid code
points where applications are strongly encouraged to not apply any kinf of
filter if they want to remain compatible with evolutions of the standard
that may assign them (the best you can do with unassigned code points is
treat them as symbols, with the minimial properties defined in the standard
(notably Bidi properties according to their range, where this direction is
defined in some ranges, or treat them as symbols with weak direction), even
if applications cannot still render them (renderers will find a way to show
them, generally using a .notdef glyph like empty boxes). Normalizers will
also not mix them (the default combining class should be 0).

Only applications that want to ensure that the text conforms to a specific
version of the standard are allowed to filter out or signal as errors the
presence of unassigned code points. But all applications can do that kind
of things on non-characters (or any code unit whose value falls outside the
valid range of a defined UTFà. This is an important difference.
non-characters are not like unassigned code points, they are assigned to be
considered invalid and filterable by design by any Unicode conforming
process for handling text.





2014-06-02 23:53 GMT+02:00 Markus Scherer markus@gmail.com:

 On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
 wrote:

 I would especially discourage any web browser from handling
 these; they're noncharacters used for unknown purposes that are
 undisplayable and if used carelessly for their stated purpose, can
 probably trigger serious bugs in some lamebrained utility.


 I don't expect handling these in web browsers and lamebrained utilities.
 I expect treat like unassigned code points.

 markus

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
reserved for CLDR would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound to the ICU implementation of CLDR for its own definitions and
syntax for its tailorings).

CLDR data defines its own interface and protocol, it can reserve these code
points only for itself but not in TUS and no other conforming plain-text
application is expected to accept these reservations, so they can
**freely** mark them in error, replace them, or filter them out, or
interpret them differently for their own usage, using their own
specification and encapsulation mechanisms and specific **non-plain-text**
data types.

CLDR data transmitted in binary form that would embed these code points are
not transporting plain-text, this is still a binary datatype specific to
this application. CLDR data must remain isolated in its scope without
forcing other protocols or TUS to follow its practices.

Other applications may develop gateway interfaces to convert them to be
interoperable with ICU but they are not required to do that. If they do,
they will follow the ICU specifications, not TUS and this should not
influence their own way to handle what TUS describe as plain-text.

To make it clear, it is referable to just say in TUS that the behavior of
applications with non-characters is completely undefined and unpredictable
without an external specification, and these entities should not even be
considered as encodable in any standard UTFs (which can be freely be
replaced by another one without causing any loss or modification of the
represented plain-text). It should be possible to define other (non
standard) conforming UTFs which are completely unable to represent these
non-characters (as well as any unpaired surrogate). A conforming UTF just
needs to be able to represent streams of scalar values in their full
standard range (even without knowing if they are assigned or not or without
knowing their character properties).

You can/should even design CLDR to completely ovoid the use of
non-characters: it's up to it to define an encapsulation/escaping mechanism
that clearly separates what is standard plain-text in the content and what
is not and used for specific purpose in CLDR or ICU implementations.




2014-06-03 0:07 GMT+02:00 Shawn Steele shawn.ste...@microsoft.com:

  Except that, particularly the max-weight ones, mean that developers can
 be expected to use this as sentinels in code using ICU, which would
 preclude their use for other things?



 Which makes them more like “reserved for use in CLDR” than “noncharacters”?



 -Shawn



 *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Markus
 Scherer
 *Sent:* Monday, June 2, 2014 2:53 PM
 *To:* David Starner
 *Cc:* Unicode Mailing List
 *Subject:* Re: Corrigendum #9



 On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
 wrote:

  I would especially discourage any web browser from handling

 these; they're noncharacters used for unknown purposes that are
 undisplayable and if used carelessly for their stated purpose, can
 probably trigger serious bugs in some lamebrained utility.



 I don't expect handling these in web browsers and lamebrained utilities.
 I expect treat like unassigned code points.



 markus

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Lisa Moore
I would like to point out to Asmus that this decision was reached 
unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC 
Berkeley, and Yahoo!

One might disagree with the decision, but there were no special favors 
involved.

Lisa 

 
 
 I can't shake the suspicion that Corrigendum #9 is not actually 
 solving a general problem, but is a special favor to CLDR as being 
 run by insiders, and in the process muddying the waters for everyone 
else.
 
 A./___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 15:09:21 -0700
David Starner prosfil...@gmail.com wrote:

 So certain programs can't use noncharacters internally because some
 people want to interchange them? That doesn't seem like what
 noncharacters should be used for.

Much as I don't like their uninvited use, it is possible to pass them
and other undesirables through most applications by a slight bit of
recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
characters, one can ape UTF-16 surrogates and encode:

32 × 64 pairs for lone surrogates
 1 × 64 pairs to replace some of the PUA characters
 1 × 35 pairs to replace the rest of the PUA characters
 1 ×  4 pairs for incoming FFFC to 
 1 × 32 pairs for the other BMP non-characters
 1 × 32 pairs for the supplementary plane non-characters.

This then frees up non-characters for the application's use.

Richard.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode