Tom Christiansen <tchr...@perl.com> added the comment: "Terry J. Reedy" <rep...@bugs.python.org> wrote on Fri, 12 Aug 2011 22:21:59 -0000:
> Does the regex module handle these particular issues better? No, it currently does not. One would have to ask Matthew directly, but I believe it was because he was trying to stay compatible with re, sometimes apparently even if that means being bug compatible. I have brought it to his attention though, and at last report he was pondering the matter. In contrast to how Python behaves on narrow builds, even though Java uses UTF-16 for its internal representation of strings, its Java Pattern is quite adamant about treating with logical code points alone. Besides running afoul of tr18, it is senseless to do otherwise. A dot is one Unicode code point, no matter whether you have 8-bit code units, 16-bit code units, or 32-bit code units. Similarly, character classes and their negations only match entire code points, never pieces of the same. ICU's regexes work the same way the normal Java Pattern library does. So too do Perl, Ruby, and Go. Python is really the odd man out here. Almost. One interesting counterexample is the vim editor. It has dot match a complete grapheme no matter how many code points that requires, because we're dealing with user-visible characters now, not programmer-visible one. It is an unreasonable burden to make the programmer deal with the fine-grained details of low-level serialization schemes instead of at least(*) the code point level of operations, which is the minimum for getting real work done. (*Note that tr18 admits that accessing text at the code point level meets only programmer expectations, not those of the user, and therefore to meet user expectations much more elaborate patterns must necessarily be constructed than if logical groups of coarser granularity than code points alone are supported.) Python should not be subject to changing its behavior from one build to the next. This astonishing narrow-vs-wide build behavior makes it virtually impossible to write portable code to work on arbitrary Unicode text. You cannot even know whether you need to match one dot or two to get a single code point, and similarly for character indexing, etc. Even identifiers come into play. Surrogates should be utterly nonexistent/invisible at this, the normal level of operation. An API that minimally but uniformly deals with logical code points and nothing finer in granularity is the only way to go here. Please trust me on this one. Graphemes (tr18 Level 2) and collation elements (Level 3) will someday build on that, but one must first support code points properly. That's why it's a Level 1 requirement. --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com