Re: Unicode Regular Expressions, Surrogate Points and UTF-8
\uD808\uDF45 specifies a sequence of two codepoints. That is simply incorrect. In Java (and similar environments), \u means a char (a UTF16 code unit), not a code point. Here is the difference. If you are not used to Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x with the replacement y in string. Backslashes in literals need escaping, so \x needs to be written in literals as \\x. String[] tests = {\\x{12345}, \\uD808\\uDF45, \uD808\uDF45, «.»}; String target = one: «\uD808\uDF45»\t\t + two: «\uD808\uDF45\uD808\uDF45»\t\t + lead: «\uD808»\t\t + trail: «\uDF45»\t\t + one+: «\uD808\uDF45\uD808»; System.out.println(pattern + \t→\t + target + \n); for (String test : tests) { System.out.println(test + \t→\t + target.replaceAll(test, §︎)); } *Output:* pattern → one: «⍅» two: «⍅⍅» lead: «?» trail: «?» one+: «⍅?» \x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» \uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» ⍅ → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» «.» → one: §︎ two: «⍅⍅» lead: §︎ trail: §︎ one+: «⍅?» The target has various combinations of code units, to see what happens. Notice that Java treats a pair of lead+trail as a single code point for matching (eg .), but also an isolated surrogate char as a single code point (last line of output). Note that Java's regex in addition allows \x{hex} for specifying a code point explicitly. It also has the syntax \u (in a literal the \ needs escaping) to specify a code unit; that is slightly different than the Java preprocessing. Thus the first two are equivalent, and replace { by x. The last two are also equivalent—and fail—because a single { is a broken regex pattern. System.out.println({.replaceAll(\\u007B, x)); System.out.println({.replaceAll(\\x{7B}, x)); System.out.println({.replaceAll(\u007B, x)); System.out.println({.replaceAll({, x)); Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sun, 1 Jun 2014 08:58:26 -0700 Markus Scherer markus@gmail.com wrote: You misunderstand. In Java, \uD808\uDF45 is the only way to escape a supplementary code point, but as long as you have a surrogate pair, it is treated as a code point in APIs that support them. Wasn't obvious that in the following paragraph \uD808\uDF45 was a pattern? Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence U+D808, U+DF45 cannot occur in a UTF-16 Unicode string; instead, the code unit sequence D808 DF45 is the codepoint sequence U+12345 CUNEIFORM SIGN URU TIMES KI. (It might have been clearer to you if I'd said '8-bit' and '16-bit' instead of UTF-8 and UTF-16. It does make me wonder what you'd call a 16-bit encoding of arbitrary *codepoint* sequences.) Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
Your example would have been better explained by just saying that in Java, the regexp represented in source code as \\uD808\\uDF45 means matching two successive 16-bit code units, and \\uD808 or \\uDF45 just matches one. The \\u regexo notation (in source code, equivalentto \u in string at runtime) does not designate necessarily a full code point. Unlike the \\x{} and . regexs which will necessarily match a full code point in the target (even if it's an isolated surrogate). But there's no way in Java to represent a target string that can store arbitrary sequences of codepoints if you use the String type (this is not specific to Java but applies as well to any language or runtime library handling streams of 16-bit code units, including in C, C++, Python, Javascript, PHP...). The problem is then not in the way you write regexps, but in the way the target string is encoded : it is not technically posible with 16-bit streams to represent arbitrary sequences of codepoints, but only arbitrary sequences of 16-bit code units (even if they aren't valid UTF-16 text). But there's no problem at all to process valid UTF-16 streams. Your lead, trail and one+ are representable in Java as arbitrary 16-bit streams but they do not represent not valid Unicode texts. On the opposite all your tests[] strings are valid Unicode texts but their interpretation as regexps are not necessarily valid regexps. Each time you use single backslashes in a Java source-code string, there's no warranty it will be a valid Unicode text even though it will compile without problem as a valid 16-bit stream (and the same will be true in other languages). If you want to represent aribtrary sequences of codepoints in a target text, you cannot use any UTF alone (it may be technically possible with UTF-8 or UTF-32, but these are also invalid for these standard encodings), without using an escaping mechanism such as the double backslashes like in the notation of regexps. This escaping mechnims is then independant of the actual runtime encoding used to transport the escaped streams within valid Unicode texts. In summary; arbitrary sequences of codepoints in a valid Unicode text require escaping mechanism on top of the actual text encoding for the storage or transport (there are other ways to escape arbitrary streams into valid texts, including the U+NN notation, or Base64 or Hex or octal representation of UTF-32, or Punycode. and many other technics used to embed binary objets (UUCP, Postscript streams). In HTTP a few of them are suported as standard transport syntaxes. Terminal protocols (like VT220 and related, or Videotext) have since long used escape sequences (plus controls like SI/SO encapsulation and isolated DLE escapes for transporting 8-bit data over a 7-bit stream) Technically the Java strings at runtime are not plain text (unless they are checked on input and the validaty conditions are not brokeb by some text transforms like extraction ob substrings at arbitrary absolute positions, or with error recovery with resynchronization after a failure or missing data, where these errors are likely to occur because we have no warranty that validity is kept during the exchange by matching preconditions and postconditions), they are binary object (and this is also true for C/C++ standard strings, or PHP strings, or the content transported by an HTTP session or a terminal protocol (defining also its own escaping mechanism where needed). If yuo develop a general purpose library in any language that can be reused in arbitrary code, you cannot assume on input that all preconditions are satisfied so you need to check the input. And you also have to be careful about the design of your library to make sure that it respects the postconditions (some library APIs are technically unsafe, notably extracting substrings and almost blocked I/O using fixed size buffers such as file I/O in filesystems that do not discritimate text files and binary files (so that text files will use buffers with variable length only broken at codepoint positions and not at arbitrary code unit positions. As far as I know, there does not exist any filesystem that enforce code point positions (unless it uses non-space efficient encodings with code units wider than 20 bits (storage devices are optimized for code units wth size that are a power of 2 in bytes, so you would finally use only files whose sizes in bytes is a multiple of 4 and all random access file positions also a multiple of 4 bytes. You could also use 24-but storage code units with blocks limited to sectors of 255 bytes with the extra byte only used as a filler or as a length indicator in that sector (255 bytes would store 85 arbitrary code units of 24 bits but you will still need to check the value range of these code units if you want to restrict the the U+.U+10 codepoint space, unless your application code handles all of the extra code units like non-character code points) However the
Re: Corrigendum #9
It seems that the broadening of the term interchange in this corrigendum to mean almost any type of processing imaginable, below, is what caused the trouble. This is the decision that would need to be reconsidered if the real intent of noncharacters is to be expressed. I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. This is necessary for the effective use of noncharacters, because anytime a Unicode string crosses an API boundary, it is in effect being interchanged. Furthermore, for distributed software, it is often very difficult to determine what constitutes an internal versus an external context for any particular software process. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org wrote: I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. Right, in principle. However, it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. It seems that trying to define interchange and public in ways that satisfy everyone will not be successful. The FAQ already gives some examples of where noncharacters might be used, should be preserved, or could be stripped, starting with Q: Are noncharacters intended for interchange? http://www.unicode.org/faq/private_use.html#nonchar6 In my view, those Q/A pairs explain noncharacters quite well. If there are further examples of where noncharacters might be used, should be preserved, or could be stripped, and that would be particularly useful to add to the examples already there, then we could add them. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.com wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors’ tools or feeds or databases or whatever start emitting non-characters from internal use, then we’re going to have ugly leak into text “everywhere”. So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with “foreign” apps. -Shawn ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
That’s what I think is exactly what should be clarified. A cooperating system of apps should likely use some other markup, however if they want to use to say “OK to insert ad here” (or whatever), that’s up to them. I fear that the current wording says “Because you might have a cooperating system of apps that all agree is ‘OK to insert ad here’, you may as well emit all the time just in case some other app happens to use the same sentinel”. The “problem” is now that previously these characters were illegal, so my application didn’t have to explicitly remove them when importing external stuff because they weren’t allowed to be there. With the wording of the corrigendum, the onus is on every app importing data to filter out these code points because they are “suddenly” legal in foreign data streams. That is a breaking change for applications, and, worse, it isn’t in the control of the applications that take advantage of the newly laxer wording, but rather all the other applications on the planet, which may have been stable for years. My interpretation of “interchanged” was “interchanged outside of a system that understood your private use of the noncharacters”. I can see where that may not have been everyone’s interpretation, and maybe should be updated. My interpretation of what you’re saying below is “sentinel values with a private meaning can be exchanged between apps”, which is what the PUA’s for. I don’t mind at all if the definition is loosened somewhat, but if we’re turning them into PUA characters we should just turn them into PUA characters. -Shawn From: mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] On Behalf Of Mark Davis ?? Sent: Monday, June 2, 2014 9:08 AM To: Shawn Steele Cc: Markus Scherer; Doug Ewell; Unicode Mailing List Subject: Re: Corrigendum #9 The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Markhttps://google.com/+MarkDavis — Il meglio è l’inimico del bene — On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors’ tools or feeds or databases or whatever start emitting non-characters from internal use, then we’re going to have ugly leak into text “everywhere”. So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with “foreign” apps. -Shawn ___ Unicode mailing list Unicode@unicode.orgmailto:Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
Shawn Steele Shawn dot Steele at microsoft dot com wrote: So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with foreign apps. If any wording is to be revised, while we're at it, I'd also like to see a reaffirmation of the proper relationship between private-use characters and noncharacters. I still hear arguments that private-use characters are to be avoided in public interchange at all costs, as if lack of knowledge of the private agreement, or conflicting interpretations, will cause some kind of major security breach. At the same time, the Corrigendum seems to imply that noncharacters in public interchange are no big deal. That seems upside-down. Mark Davis mark at macchiato dot com replied: The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). Correct. Most people wouldn't consider a cooperating system like that quite the same as true public interchange, like throwing this ��� into a message on a public mailing list. Since the Corrigendum deals with recommendations rather than hard requirements, SHOULDs rather than MUSTs, it doesn't seem that a bright line is really needed. I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. And threw the baby out to fix it. A./ Mark https://google.com/+MarkDavis / / /— Il meglio è l’inimico del bene —/ // ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote: The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). The way to draw the line is to insist on there being an agreement between sender and ultimate receiver, and an pass-through agreement (if you will) for any intermediate stage, so that the coast is clear. What defines an implementation in this scenario, is the existence of the agreement. What got us into trouble is that the negative case (pass-through) was not well-defined, and lead to people assuming that they had to filter any incoming noncharacters. Because noncharacters can have any interpretation (not limited to interpretations as characters), it is much riskier to send then out oblivious whether the intended recipient is part of the same agreement on their interpretation as the sender. In that sense, they are not mere PUA code points. The other aspect of their original design was to allow code points that recipients were free no to honor or preserve, if they were not part of the agreement (and hadn't made an explicit or implicit pass-through agreement). Otherwise, if anyone expects them to be preserved, no application like Word, would be free to use these for purely internal use. Word thus would not be a tool to handle CLDR data; which may be disappointing to some, but should be fine. A./ I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark https://google.com/+MarkDavis / / /— Il meglio è l’inimico del bene —/ // On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors’ tools or feeds or databases or whatever start emitting non-characters from internal use, then we’re going to have ugly leak into text “everywhere”. So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with “foreign” apps. -Shawn ___ Unicode mailing list Unicode@unicode.org mailto:Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
I wrote, sort of: Correct. Most people wouldn't consider a cooperating system like that quite the same as true public interchange, like throwing this ��� into a message on a public mailing list. Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did I deserve what I got? Are those two different questions? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
I disagree with that characterization, of course. The recommendation for libraries and low-level tools to pass them through rather than screw with them makes them usable. The recommendation to check for noncharacters from unknown sources and fix them was good advice then, and is good advice now. Any app where input of noncharacters causes security problems or crashes is and was not a very good app. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. And threw the baby out to fix it. A./ Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 6/2/2014 9:38 AM, Shawn Steele wrote: I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) FAQ's are useful, but they are not binding. They are even less binding than general explanation in the text of the Core specification, which itself doesn't rise to the that of conformance clauses and definition... Doug's unease about the upside-down nature of the wording regarding PUA and noncharacters is something that should be addressed in revised text in the core specification. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processor internal formatting sequences. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did I deserve what I got? Are those two different questions? I think I just got spaces. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.com wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
Hmm, I find that disconcerting. I’d prefer a real Unicode character with special weights if that concept’s needed. And I guess that goes a long ways to explaining the interchange problem since clearly the code editor’s going to need these ☹ From: Markus Scherer [mailto:markus@gmail.com] Sent: Monday, June 2, 2014 10:17 AM To: Shawn Steele Cc: Asmus Freytag; Doug Ewell; Mark Davis ☕️; Unicode Mailing List Subject: Re: Corrigendum #9 On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, 2 Jun 2014 10:17:04 -0700 Markus Scherer markus@gmail.com wrote: CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using XML collation syntax), these were raw noncharacters in the .xml files. As I said earlier: it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. They come as a nasty shock when someone thinks XML files are marked-up text files. I'm still surprised that the published human-readable form of CLDR files should contain automatically applied non-Unicode copyright claims. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ☕️ m...@macchiato.com wrote: \uD808\uDF45 specifies a sequence of two codepoints. That is simply incorrect. The above is in the sample notation of UTS #18 Version 17 Section 1.1. From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. All UTS #18 says for sure about regular expressions matching code units is that they don't satisfy RL1.1, though Section 1.7 appears to ban them when it says, A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. Perhaps it's a fundamental requirement of something other than UTS #18. I thought matching parts of characters in terms of their canonical equivalences was awkward enough, without having the additional option of matching some of the code units! Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer markus@gmail.com wrote: Right, in principle. However, it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files etc. Why? It seems you're changing the rules so some Unicode guys can get oversmart in using Unicode in their systems. You could do the same thing everyone else does and use special tags or symbols you have to escape. I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer markus@gmail.com wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. So certain programs can't use noncharacters internally because some people want to interchange them? That doesn't seem like what noncharacters should be used for. Unix utilities shouldn't usually go to the trouble of messing with them; limiting the number of changes needed for Unicode was the whole point of UTF-8. Any program transferring them across the Internet as text should filter them, IMO; either some lamebrained utility will open a security hole by using them and not filtering first, or something will filter them after security checks have been done, or something. Unless it's a completely trusted system, text files with these characters should be treated with extreme prejudice by the first thing that receives them over the net. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Corrigendum #9
Ø I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else I think we could generalize to other scenarios so it wasn’t necessarily an insider scenario. For example, I could have a string manipulation library that used FFFE to indicate the beginning of an identifier for a localizable sentence, terminated by . Any system using FFFEid1234 would likely expect to be able to read the tokens in their favorite code editor. But I’m concerned that these “conflict” with each other, and embedding the behavior in major programming languages doesn’t smell to me like “internal” use. Clearly if I wanted to use that library in a CLDR-aware app, there is a potential risk for a conflict. In the CLDR case, there *IS* a special relationship with Unicode, and perhaps it would be warranted to explicitly encode character(s) with the necessary meaning(s) to handle edge-case collation scenarios. -Shawn ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
I better expect: treat them as you like, there will never be any warranty of interoperability, everyone is allowed to use them as they want and even change it at any time. The behavior is not defined in TUS, and users cannot expect that TUS will define this behavior. There's no clear solution about what to do if you encounter them in data supposed to be text. For me they are not text, so the whole data could be rejected or the text remaining after some filtering may be galsely interpreted. You need an external specification outside TUS. I certainly do not consider non-characters like unassigned valid code points where applications are strongly encouraged to not apply any kinf of filter if they want to remain compatible with evolutions of the standard that may assign them (the best you can do with unassigned code points is treat them as symbols, with the minimial properties defined in the standard (notably Bidi properties according to their range, where this direction is defined in some ranges, or treat them as symbols with weak direction), even if applications cannot still render them (renderers will find a way to show them, generally using a .notdef glyph like empty boxes). Normalizers will also not mix them (the default combining class should be 0). Only applications that want to ensure that the text conforms to a specific version of the standard are allowed to filter out or signal as errors the presence of unassigned code points. But all applications can do that kind of things on non-characters (or any code unit whose value falls outside the valid range of a defined UTFà. This is an important difference. non-characters are not like unassigned code points, they are assigned to be considered invalid and filterable by design by any Unicode conforming process for handling text. 2014-06-02 23:53 GMT+02:00 Markus Scherer markus@gmail.com: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
reserved for CLDR would be wrong in TUS, you have reached a borderline where you are no longer handling plain text (stream of scalar values assigned to code points), but binary data via a binary interface outside TUS (handling streams of collation elements, whose representation is not even bound to the ICU implementation of CLDR for its own definitions and syntax for its tailorings). CLDR data defines its own interface and protocol, it can reserve these code points only for itself but not in TUS and no other conforming plain-text application is expected to accept these reservations, so they can **freely** mark them in error, replace them, or filter them out, or interpret them differently for their own usage, using their own specification and encapsulation mechanisms and specific **non-plain-text** data types. CLDR data transmitted in binary form that would embed these code points are not transporting plain-text, this is still a binary datatype specific to this application. CLDR data must remain isolated in its scope without forcing other protocols or TUS to follow its practices. Other applications may develop gateway interfaces to convert them to be interoperable with ICU but they are not required to do that. If they do, they will follow the ICU specifications, not TUS and this should not influence their own way to handle what TUS describe as plain-text. To make it clear, it is referable to just say in TUS that the behavior of applications with non-characters is completely undefined and unpredictable without an external specification, and these entities should not even be considered as encodable in any standard UTFs (which can be freely be replaced by another one without causing any loss or modification of the represented plain-text). It should be possible to define other (non standard) conforming UTFs which are completely unable to represent these non-characters (as well as any unpaired surrogate). A conforming UTF just needs to be able to represent streams of scalar values in their full standard range (even without knowing if they are assigned or not or without knowing their character properties). You can/should even design CLDR to completely ovoid the use of non-characters: it's up to it to define an encapsulation/escaping mechanism that clearly separates what is standard plain-text in the content and what is not and used for specific purpose in CLDR or ICU implementations. 2014-06-03 0:07 GMT+02:00 Shawn Steele shawn.ste...@microsoft.com: Except that, particularly the max-weight ones, mean that developers can be expected to use this as sentinels in code using ICU, which would preclude their use for other things? Which makes them more like “reserved for use in CLDR” than “noncharacters”? -Shawn *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Markus Scherer *Sent:* Monday, June 2, 2014 2:53 PM *To:* David Starner *Cc:* Unicode Mailing List *Subject:* Re: Corrigendum #9 On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
I would like to point out to Asmus that this decision was reached unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC Berkeley, and Yahoo! One might disagree with the decision, but there were no special favors involved. Lisa I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. A./___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, 2 Jun 2014 15:09:21 -0700 David Starner prosfil...@gmail.com wrote: So certain programs can't use noncharacters internally because some people want to interchange them? That doesn't seem like what noncharacters should be used for. Much as I don't like their uninvited use, it is possible to pass them and other undesirables through most applications by a slight bit of recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: 32 × 64 pairs for lone surrogates 1 × 64 pairs to replace some of the PUA characters 1 × 35 pairs to replace the rest of the PUA characters 1 × 4 pairs for incoming FFFC to 1 × 32 pairs for the other BMP non-characters 1 × 32 pairs for the supplementary plane non-characters. This then frees up non-characters for the application's use. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode