[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I would encourage you to look at the Perl CPAN module Unicode::LineBreak, which fully implements tr11. It includes Unicode::GCString, a class that has a columns() method to determine the print columns. This is very fancy in the case of Asian

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Martin v. L=C3=B6wis mar...@v.loewis.de added the comment: Martin, I think you meant to write if w =3D=3D 'A':. Some very common characters have ambiguous widths though (e.g. the Greek = alphabet), so you can't just raise an error for them

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Martin v. L=C3=B6wis mar...@v.loewis.de added the comment: I would encourage you to look at the Perl CPAN module Unicode::LineBreak, which fully implements tr11. Thanks for the pointer! If you'd like, I can show you a program that uses

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-20 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Yes, it looks good. Thank you very much. -tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12753

[issue12568] Add functions to get the width in columns of a character

2011-10-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Martin v. Löwis mar...@v.loewis.de added the comment: I think the WideCharToMultibyte approach is just incorrect. I'm -1 on using wcswidth, though. Like you, I too seriously question using wcswidth() for this at all: The wcswidth

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-09 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 09 Oct 2011 13:21:00 -: Here is a new patch that stores the names of aliases and named sequences in the Private Use Area. Looks good! Thanks! --tom -- title: \N

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-03 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Mon, 03 Oct 2011 04:15:51 -: But it still has to happen at compile time, of course, so I don't know what you could do in Python. Is there any way to change how the compiler behaves even

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 02 Oct 2011 06:46:26 -: Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec= ause that's a Unicode 1 name, and nowadays these codepoints are simply mark= ed

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Really? White space makes things harder to read? I thought Pythonistas believed the opposite of that. I was surprised at that too ;-). One person's opinion in a specific context. Don't generalize. The example I initially showed

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Martin v. Löwis rep...@bugs.python.org wrote on Sat, 01 Oct 2011 10:59:48 -: * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. Where did you get that definition from? UTS#18 defines word_character, which is Alphabetic + U

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-01 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Perl does not provide the old 1.0 names at all. We don't have a Unicode 1.0 legacy to support, which makes this cleaner. However, we do provide for the names of the C0 and C1 Control Codes, because apart from Unicode 1.0, they don't

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Martin v. Löwis mar...@v.loewis.de added the comment: Split S into words. Change the first letter in a word to upper-case, Except that I think you actually mean that the first letter is changed into titlecase not uppercase. One might

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-09-30 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: Leaving named sequences for unicodedata.lookup() only (and not for \N{}) makes sense. There are certainly advantages to that strategy: you don't have to deal with [\N{sequence}] issues

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Mon, 19 Sep 2011 11:11:48 -: We could also look at what other languages do and/or ask to the Unicode consortium. I will look at what Java does a bit later on this morning, which

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: No good news on the Java front. They do all kinds of things wrong. For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 input stream, which is illegal. There's more they do wrong, including in their documentation, but I

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: It appears that I'm right about surrogates, but wrong about noncharacters. I'm seeking a clarification there. --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-18 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Thu, 08 Sep 2011 18:56:11 -: On 9/8/2011 4:32 AM, Ezio Melotti wrote: So to summarize a bit, there are different possible level of strictness: 1) all the possible encodable values

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-07 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sat, 03 Sep 2011 00:28:03 -: Ezio Melotti ezio.melo...@gmail.com added the comment: Or they are still called UTF-8 but used in combination with different error handlers, like

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Antoine Pitrou rep...@bugs.python.org wrote on Mon, 29 Aug 2011 13:21:06 -: It's not only typographically speaking, it's really a spelling error, even in hand-written text :-) Sure, and so too is omitting an accent mark

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Antoine Pitrou rep...@bugs.python.org wrote on Sat, 27 Aug 2011 20:04:56 -: Neither am I. Even in old-style English with ae and oe, one wrote ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or *Aesir. Similarly

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-27 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Sat, 27 Aug 2011 03:26:21 -: To me, making (default) iteration deviate from indexing is anathema. So long is there's a way to interate through a string some other way that by code

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Fri, 26 Aug 2011 21:11:24 -: Would this also affect .islower() and friends? SHORT VERSION: (7 lines) I don't believe so, but the relationship between lower() and islower

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Sat, 27 Aug 2011 16:15:33 -: Although personally I don't have much of an intuition for what titlecase means (and why it's important), perhaps because I'm not familiar with any

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Sounds like a fair feature request for Python 3.3, as long as the intention is that users must import some module from the standard library and use functions defined in that module. The operations and methods defined for str instances

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Fri, 26 Aug 2011 21:16:57 -: Yeah, this should be fixed in 3.3 and probably backported to 3.2 and 2.7. (There is already no guarantee that len(s) == len(s.title()), right?) Well

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Raymond Hettinger raymond.hettin...@gmail.com added the comment: I would like to be involved in the design of the API for a UCA module and its routines for loading Unicode Collation Element Tables (not making the mistake of using global

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I should probably mention the importance in the design of a UCA module of being able to specify which UCA version number you want it to behave like in case you plan to override some of the DUCET entries. That way if you run under a later UCA

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Fri, 26 Aug 2011 21:55:03 -: I know I sound like NIH, but I'm always reluctant to add a big 3rd party lib like ICU to the permanent dependencies of all future Python distros

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Guido van Rossum rep...@bugs.python.org wrote on Fri, 26 Aug 2011 21:11:24 -: Guido van Rossum gu...@python.org added the comment: I presume this applies to builtin str methods like .lower(), right? I think it is a good thing

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Here’s my casing test suite; I thought I sent it in but the mux file here isn’t the full thing. It does several things, including letting you run it with regex vs re. It also checks for the islower, etc functions. It has both simple

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Fri, 19 Aug 2011 22:50:58 -: My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Matthew Barnett rep...@bugs.python.org wrote on Fri, 19 Aug 2011 23:36:45 -: For the Line_Break property, one of the possible values is Inseparable, with 2 permitted aliases, the shorter IN (which is reasonable) and Inseperable

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES are quite

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I now see there are lots of good things in the BOM FAQ that have come up lately regarding surrogates and other illegal characters, and about what can go in data streams. I quote a few of these from http://unicode.org/faq/utf_bom.html below

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Antoine Pitrou rep...@bugs.python.org wrote on Tue, 16 Aug 2011 09:18:46 -: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Tue, 16 Aug 2011 09:23:50 -: All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them more

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Marc-Andre Lemburg rep...@bugs.python.org wrote on Tue, 16 Aug 2011 12:11:22 -: The reasoning behind e.g. ISSURROGATE is that those names originate from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE macros

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Mon, 15 Aug 2011 04:56:55 -: Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined on the fly: p '\ud800\udc00' len(p) 2 p.encode('utf-16').decode

[issue12746] normalization is affected by unicode width

2011-08-15 Thread Tom Christiansen
Changes by Tom Christiansen tchr...@perl.com: -- nosy: +tchrist ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12746 ___ ___ Python-bugs-list

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy tjre...@udel.edu added the comment: My Firefox is already set at utf-8. More likely a font limitation. I will look again after installing one of the fonts Tom suggested. Symbola is best for exotic glyphs, especially astral

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy tjre...@udel.edu added the comment: You are right, FF switched on me without notice. Bad FF. Thank you! What I now see makes much more sense. [ мЯхШщЯл, мЯхШщЯл, ДЯхШщЯл, ДЇНЀСЇГ ], and I now know to check on other

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Sorry I didn't include a test case. Hope this makes up for it. If not, please tell me how to write better test cases. :( Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical alignment, but it really helps to make what

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Oh whoops, that was the long ticket. Shall I reupload to the right number? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12734

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy tjre...@udel.edu added the comment: Adding Symbola filled in the symbols and emoticons lines. The gothic chars are still missing even with Alfios. That's too bad, as the Gothic paternoster is kinda cute. :) Hm, I wonder where

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Here’s the right test file for the right ticket. -- Added file: http://bugs.python.org/file22903/nametests.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12753

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen
Changes by Tom Christiansen tchr...@perl.com: Removed file: http://bugs.python.org/file22902/nametests.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12734

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: It is simply a design error to pretend that the number of characters is the number of code units instead of code points. A terrible and ugly one, but it does not mean you are UCS-2

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: if re.search([풜-풵], 풞, re.UNICODE): print(match 1 passed) else: print(match 2 failed) The best you can possibly do

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 07:15:09 -: Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever. The UTF-8 codec

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: On wide 3.2 it passes too, so the failure is limited to narrow builds (are = you sure that it fails on wide builds for you?). You're right: my wide build is not Python3, just Python2

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 07:15:09 -: For example I don't think removing the 0x10 upper limit is going to happen -- even if it might be useful for other things. I agree entirely. That's why

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:15:52 -: You're right: my wide build is not Python3, just Python2. And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3. Perhaps I am

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:46:55 -: I'm a bit confused on this. You no longer fix bugs in Python 2? We do, but it's unlikely that we will introduce major changes in behavior. Even if we had

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Mon, 15 Aug 2011 00:26:53 -: PS: The OSCON link in msg142036 currently gives me 404 not found Sorry, I wrote http://training.perl.com/OSCON/index.html but meant http

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I wrote: Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm notion of what UCS-2 and UTF-16 are, and so I get sometimes self

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: David Murray rep...@bugs.python.org wrote: Tom, note that nobody is arguing that what you are requesting is a bad thing :) There looked to be minor some resistance, based on absolute backwards compatibility even if wrong, regarding

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Matthew Barnett rep...@bugs.python.org wrote on Sat, 13 Aug 2011 20:57:40 -: There are occasions when you want to do string slicing, often of the form: pos = my_str.index(x) endpos = my_str.index(y) substring = my_str[pos

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Antoine Pitrou rep...@bugs.python.org wrote on Sat, 13 Aug 2011 21:09:52 -: And/or a lookup table giving the byte offset of, say, every 16th character. It gives you a O(1) lookup with a relatively reasonable constant cost (you have

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. Perhaps someone could tell me why the Python documentation says it uses UCS-2 on a narrow build. There's a disagreement on that point between several developers

[issue11230] Full unicode import system not in 3.2

2011-08-12 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Whoops, I meant that it appears that Python runs its identifiers through NFC. How that gets along with a filesystem that has quasi-NFD filenames I'm not sure, but it seems like it might be a variant of the case-insensitivity issue

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-12 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy tjre...@udel.edu added the comment: I am not sure that everyone will agree that this is a bug, rather than a fe= ature request, or that if a bug, that it should be changed in existing rele= ases and possibly break running

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Fri, 12 Aug 2011 22:21:59 -: Does the regex module handle these particular issues better? No, it currently does not. One would have to ask Matthew directly, but I believe

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy tjre...@udel.edu added the comment: However desireable it would be, I do not believe there is any claim in the = manual that the re module follows the evolving Unicode consortium r.e. stan= My from the hip thought

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-12 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Fri, 12 Aug 2011 23:05:27 -: Ouch! Do the rejected characters qualify as identifier characters as defined in Reference 2.3 Identifiers and keywords? http://docs.python.org/py3k

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as basic logical units independent of serialization like

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: You cannot use Python's casemapping functions on Unicode data because they fail on narrow builds. This makes it impossible to write portable code in Python that can cope with full Unicode. I've tried several times to submit this bug

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: You cannot use Python's lib re for handling Unicode regular expressions because it violates the standard set out for the same in UTS#18 on Unicode Regular Expressions in RL1.2a on compatibility properties. What \w is allowed to match

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: You cannot reliably use Unicode in Python identifiers because of the narrow/wide build issue. The enclosed file is fine on wide builds but gets compiler errors on narrow ones during compilation. Go, Ruby, Java, and Perl all handle

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen
Changes by Tom Christiansen tchr...@perl.com: -- components: +Regular Expressions -Library (Lib) type: - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12728

[issue12733] Request for grapheme support in Python re lib

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Without proper grapheme support in the regular expression library, it is impossible to correctly process Unicode. And the very least, one needs the \X escape supported, which is an extended grapheme cluster per UTS#18. This escape

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Python supports no Unicode properties in its re library, making it unsuitable for work with Unicode. This is therefore a formal request for the Python re library to support Unicode properties. The eleven properties required by Unicode

[issue12735] request full Unicode collation support in std python library

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Python has no standard support for the Unicode Collation Library as explained in UTS #10. This is request that UCA library be added to the standard Python distribution. Collation underlies virtually everything we do with text, not just

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Python's casemapping functions only use what Unicode calls simple casemaps. These are only appropriate for functions that operate on single characters alone, not for those that operate on strings. The reason for this is that you get much

[issue12737] string.title() is overzealous by upcasing combining marks inappropriately

2011-08-11 Thread Tom Christiansen
New submission from Tom Christiansen tchr...@perl.com: Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I've been a lot of testing of Matthew's regex library against UTS#18 issues, but only somewhat incidentally testing re. To use regex, one has to accept that certain things will work differently than they work in re, because he is following

[issue12568] Add functions to get the width in columns of a character

2011-08-11 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking. There might be something you can steal from http://search.cpan.org/perldoc?Unicode

[issue11230] Full unicode import system not in 3.2

2011-08-11 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: How does this work for modules that have filesystem names different from the one used for import? The issue I'm thinking about is that the Mac HSF+ filesystem keeps its Unicode filenames in (close to) NFD form. That means that a module

[issue2857] add codec for java modified utf-8

2011-08-11 Thread Tom Christiansen
Tom Christiansen tchr...@perl.com added the comment: Please do not call this utf-8-java. It is called cesu-8 per UTS#18 at: http://unicode.org/reports/tr26/ CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who