Your example would have been better explained by just saying that in Java, the regexp represented in source code as "\\uD808\\uDF45" means matching two successive 16-bit code units, and "\\uD808" or "\\uDF45" just matches one.
The "\\uNNNN" regexo notation (in source code, equivalentto "\uNNNN" in string at runtime) does not designate necessarily a full code point. Unlike the "\\x{NNNN}" and "." regexs which will necessarily match a full code point in the target (even if it's an isolated surrogate). But there's no way in Java to represent a target string that can store arbitrary sequences of codepoints if you use the String type (this is not specific to Java but applies as well to any language or runtime library handling streams of 16-bit code units, including in C, C++, Python, Javascript, PHP...). The problem is then not in the way you write regexps, but in the way the target string is encoded : it is not technically posible with 16-bit streams to represent arbitrary sequences of codepoints, but only arbitrary sequences of 16-bit code units (even if they aren't valid UTF-16 text). But there's no problem at all to process valid UTF-16 streams. Your "lead", "trail" and "one+" are representable in Java as arbitrary 16-bit streams but they do not represent not valid Unicode texts. On the opposite all your "tests[]" strings are valid Unicode texts but their interpretation as regexps are not necessarily valid regexps. Each time you use single backslashes in a Java source-code string, there's no warranty it will be a valid Unicode text even though it will compile without problem as a valid 16-bit stream (and the same will be true in other languages). If you want to represent aribtrary sequences of codepoints in a target text, you cannot use any UTF alone (it may be technically possible with UTF-8 or UTF-32, but these are also invalid for these standard encodings), without using an escaping mechanism such as the double backslashes like in the notation of regexps. This escaping mechnims is then independant of the actual runtime encoding used to transport the escaped streams within valid Unicode texts. In summary; arbitrary sequences of codepoints in a valid Unicode text require escaping mechanism on top of the actual text encoding for the storage or transport (there are other ways to escape arbitrary streams into valid texts, including the U+NNNNNN notation, or Base64 or Hex or octal representation of UTF-32, or Punycode. and many other technics used to embed binary objets (UUCP, Postscript streams). In HTTP a few of them are suported as standard "transport syntaxes". Terminal protocols (like VT220 and related, or Videotext) have since long used escape sequences (plus controls like SI/SO encapsulation and isolated DLE escapes for transporting 8-bit data over a 7-bit stream) Technically the Java strings at runtime are not plain text (unless they are checked on input and the validaty conditions are not brokeb by some text transforms like extraction ob substrings at arbitrary absolute positions, or with error recovery with resynchronization after a failure or missing data, where these errors are likely to occur because we have no warranty that validity is kept during the exchange by matching preconditions and postconditions), they are binary object (and this is also true for C/C++ standard strings, or PHP strings, or the content transported by an HTTP session or a terminal protocol (defining also its own escaping mechanism where needed). If yuo develop a general purpose library in any language that can be reused in arbitrary code, you cannot assume on input that all preconditions are satisfied so you need to check the input. And you also have to be careful about the design of your library to make sure that it respects the postconditions (some library APIs are technically unsafe, notably extracting substrings and almost blocked I/O using fixed size buffers such as file I/O in filesystems that do not discritimate text files and binary files (so that text files will use buffers with variable length only broken at codepoint positions and not at arbitrary code unit positions. As far as I know, there does not exist any filesystem that enforce code point positions (unless it uses non-space efficient encodings with code units wider than 20 bits (storage devices are optimized for code units wth size that are a power of 2 in bytes, so you would finally use only files whose sizes in bytes is a multiple of 4 and all random access file positions also a multiple of 4 bytes. You could also use 24-but storage code units with blocks limited to sectors of 255 bytes with the extra byte only used as a filler or as a length indicator in that sector (255 bytes would store 85 arbitrary code units of 24 bits but you will still need to check the value range of these code units if you want to restrict the the U+0000.U+10FFFF codepoint space, unless your application code handles all of the extra code units like non-character code points) However the filesystem may perform this check when writing text files so that it could mark files that are valid Unicode strings by updating some metadata (that metadata could be stored in the spare byte of the first 256-bytes sector or using a separate indexing database of compatible files). You coudl do that also for in-memory temporary buffers by keeping this info (this would allow to discriminate very early those streams that are not plain-text without having to process them up to the end). A relational database could aslso perform this check when creating indexes on table keys so that it will know that it cano only return valid text for any subselection in a table. In all cases, this will still be more efficient for small storage than using any transport syntax or escaping, but generally for moderate and large volumes, the transport syntax or escaping mechanism often wins in terms of performance by minimizing the volume of I/O, notably if theses I/O are very costly compared to data in working memory or even in CPU data caches (bit only if this data is very frequently reaccessed in this cache). However, if your I/O is very slow compared to CPU and the data volume sufficient large, it is always better to use UTF-32 in memory but storing that data in a compressed stream (you can safely use a generic binary compressor which will generally work better with UTF-32 compared to UTF-8 or UTF-16 for moderate and large volumes). Simple algorithms like deflate or even basic Huffmann encoding will generate excellent throughout with very modest CPU cost compared the huge to I/O costs that such compression saved (and in tht case, even a range check on input will have insignicant cost, thanks to branch prediction in your code using the fast pipelined path only for valid texts, and the slower non pipelined path only for exceptions and error handling most processors and CPU caches use now excellent branch predictors, even if code compilers can help them). In summary, ranche checking should no longer be only a debugging option in code (even for production code), even for internal libraries, its cost is rapidly insignificant for large data volumes. Just design your algorithms to minimize the number of state variables and minimize table lookups in order to improve data locality, instead of local data sizes for just one or a few code points or code units: just select runtime code units that can fit in a single CPU register (almost all processors today have registers at least 32-bit wide, so UTF-32 is not a problem for local processing in native code). 2014-06-02 11:29 GMT+02:00 Mark Davis ☕️ <m...@macchiato.com>: > > \uD808\uDF45 specifies a sequence of two codepoints. > > That is simply incorrect. > > In Java (and similar environments), \uXXXX means a char (a UTF16 code > unit), not a code point. Here is the difference. If you are not used to > Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x > with the replacement y in string. Backslashes in literals need escaping, so > \x needs to be written in literals as \\x. > > String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45", > "«.»"}; > String target = > "one: «\uD808\uDF45»\t\t" + > "two: «\uD808\uDF45\uD808\uDF45»\t\t" + > "lead: «\uD808»\t\t" + > "trail: «\uDF45»\t\t" + > "one+: «\uD808\uDF45\uD808»"; > System.out.println("pattern" + "\t→\t" + target + "\n"); > for (String test : tests) { > System.out.println(test + "\t→\t" + target.replaceAll(test, "§︎")); > } > > > *Output:* > pattern → one: «𒍅» two: «𒍅𒍅» lead: «?» trail: «?» one+: «𒍅?» > > \x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» > \uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» > 𒍅 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» > «.» → one: §︎ two: «𒍅𒍅» lead: §︎ trail: §︎ one+: «𒍅?» > > The target has various combinations of code units, to see what happens. > Notice that Java treats a pair of lead+trail as a single code point for > matching (eg .), but also an isolated surrogate char as a single code point > (last line of output). Note that Java's regex in addition allows \x{hex} > for specifying a code point explicitly. It also has the syntax \uXXXX (in a > literal the \ needs escaping) to specify a code unit; that is slightly > different than the Java preprocessing. Thus the first two are equivalent, > and replace "{" by "x". The last two are also equivalent—and fail—because a > single "{" is a broken regex pattern. > > System.out.println("{".replaceAll("\\u007B", "x")); > System.out.println("{".replaceAll("\\x{7B}", "x")); > > System.out.println("{".replaceAll("\u007B", "x")); > System.out.println("{".replaceAll("{", "x")); > > > > Mark <https://google.com/+MarkDavis> > > *— Il meglio è l’inimico del bene —* > > > On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > >> On Sun, 1 Jun 2014 08:58:26 -0700 >> Markus Scherer <markus....@gmail.com> wrote: >> >> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a >> > supplementary code point, but as long as you have a surrogate pair, >> > it is treated as a code point in APIs that support them. >> >> Wasn't obvious that in the following paragraph \uD808\uDF45 was a >> pattern? >> >> "Bear in mind that a pattern \uD808 shall not match anything in a >> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two >> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode >> string and before Unicode 5.2 could readily be taken to occur in an >> ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular >> expression engine, the codepoint sequence <U+D808, U+DF45> cannot >> occur in a UTF-16 Unicode string; instead, the code unit sequence <D808 >> DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES >> KI>." >> >> (It might have been clearer to you if I'd said '8-bit' and '16-bit' >> instead of UTF-8 and UTF-16. It does make me wonder what you'd call a >> 16-bit encoding of arbitrary *codepoint* sequences.) >> >> Richard. >> _______________________________________________ >> Unicode mailing list >> Unicode@unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > _______________________________________________ > Unicode mailing list > Unicode@unicode.org > http://unicode.org/mailman/listinfo/unicode > >
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode