[issue12741] Add function similar to shutil.move that does not overwrite
David Townshend aquavita...@gmail.com added the comment: A bit of research has shown that the proposed implementation will not work either, so my next suggestion is something along the lines of def move2(src, dst): try: os.link(src, dst) except OSError as err: # handle error appropriately, raise shutil.Error if dst exists, # or use shutil.copy2 if dst is on a different filesystem. pass else: os.unlink(src) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12741 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: It is simply a design error to pretend that the number of characters is the number of code units instead of code points. A terrible and ugly one, but it does not mean you are UCS-2. If you are referring to the value returned by len(unicode_string), it is the number of code units. This is a matter of practicality beats purity. Returning the number of code units is O(1) (num_of_bytes/2). To calculate the number of characters it's instead necessary to scan all the string looking for surrogates and then count any surrogate pair as 1 character. It was therefore decided that it was not worth to slow down the common case just to be 100% accurate in the uncommon case. If speed is more important than correctness, I can make any algorithm infinitely fast. Given the choice between correct and quick, I will take correct every single time. Plus your strings our immutable! You know how long they are and they never change. Correctness comes at a negligible cost. It was a bad choice to return the wrong answer. That said it would be nice to have an API (maybe in unicodedata or as new str methods?) able to return the number of code units, code points, graphemes, etc, but I'm not sure that it should be the default behavior of len(). Always code points, never code units. I even use a class whose length method returns the grapheme count, because even code points aren't good enough. Yes of course graphemes have to be counted. Big deal. How would you like it if you said to move three to the left in vim and it *didn't* count each graphemes as one position? Madness. The ugly terrible design error is digusting and wrong, just as much in Python as in Java, and perhaps moreso because of the idiocy of narrow builds even existing. Again, wide builds use twice as much the space than narrow ones, but one the other hand you can have fast and correct behavior with e.g. len(). If people don't care about/don't need to use non-BMP chars and would rather use less space, they can do so. Until we agree that the difference in space used/speed is no longer relevant and/or that non- BMP characters become common enough to prefer the correct behavior over the fast-but-inaccurate one, we will probably keep both. Which is why I always put loud warnings in my Unicode-related Python programs that they do not work right on Unicode if running under a narrow build. I almost feel I should just exit. I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken in a bunch of ways. You should be raising as exception in all kinds of places and you aren't. I am aware of some problems of the UTF-8 codec on Python 2. It used to follow RFC 2279 until last year and now it's been updated to follow RFC 3629. Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever. Perl has an encoding form, which it does not call UTF-8, that you can use the UTF-8 algorithm on for any code point, include non-characters and surrogates and even non-Unicode code points far above 0x10_, up to in fact 0x___ on 64-bit machines. It's the internal format we use in memory. But we don't call it real UTF-8, either. It sounds like this is the kind of thing that would be useful to you. However, for backward compatibility, it still encodes/decodes surrogate pairs. This broken behavior has been kept because on Python 2, you can encode every code point with UTF-8, and decode it back without errors: No, that's not UTF-8 then. By definition. See the Unicode Standard. x = [unichr(c).encode('utf-8') for c in range(0x11)] and breaking this invariant would probably make more harm than good. Why? Create something called utf8-extended or utf8-lax or utf8-nonstrict or something. But you really can't call it UTF-8 and do that. We actually equate UTF-8 and utf8-strict. Our internal extended UTF-8 is something else. It seems like you're still doing the old relaxed version we used to have until 2003 or so. It seems useful to be able to have both flavors, the strict and the relaxed one, and to call them different things. Perl defaults to the relaxed one, which gives warnings not exceptions, if you do things like setting PERLUNICODE to S or SD and such for the default I/I encoding. If you actually use UTF-8 as the encoding on the stream, though, you get the version that gives exceptions instead. UTF-8 = utf8-strict strictly by the standard, raises exceptions otherwise utf8 loosely only, emits warnings on encoding illegal things We currently only emit warnings or raise exceptions on I/O, not on chr operations and such. We used to raise exceptions on things like chr(0xD800), but that was a mistake caused by misunderstanding the in- memory requirements being
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Ezio Melotti ezio.melo...@gmail.com added the comment: If speed is more important than correctness, I can make any algorithm infinitely fast. Given the choice between correct and quick, I will take correct every single time. It's a trade-off. Using non-BMP chars is fairly unusual (many real-world applications hardly use non-ASCII chars). Slowing everything down just to allow non-BMP chars on narrow builds is not a good idea IMHO. Wide builds can be used if one really wants len() and other methods to work properly with non-BMP chars. Plus your strings our immutable! You know how long they are and they never change. Correctness comes at a negligible cost. Sure, we can cache the len, but we still have to compute it at least once. Also it's not just len(), but many other operations like slicing that are affected. Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever. The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it. Perl has an encoding form, which it does not call UTF-8, that you can use the UTF-8 algorithm on for any code point, include non-characters and surrogates and even non-Unicode code points far above 0x10_, up to in fact 0x___ on 64-bit machines. It's the internal format we use in memory. But we don't call it real UTF-8, either. This sounds like RFC 2279 UTF-8. It allowed up to 6 bytes (following the same encoding scheme) and had no restrictions about surrogates (at the time I think only BMP chars existed, so there were no surrogates and the Unicode consortium didn't decide that the limit was 0x10). It sounds like this is the kind of thing that would be useful to you. I believe this is what the surrogateescape error handler does (up to 0x10). Why? Create something called utf8-extended or utf8-lax or utf8-nonstrict or something. But you really can't call it UTF-8 and do that. That's what we did in Python 3, but on Python 2 is too late to fix it, especially in a point release. (Just to clarify, I don't think any of these things will be fixed in 2.7. There won't be any 2.8, and major changes (especially backwards-incompatible ones) are unlikely to happen in a point release (e.g. 2.7.3), so it's better to focus on Python 3. Minor bug fixes can still be done even in 2.7 though.) Perl defaults to the relaxed one, which gives warnings not exceptions, if you do things like setting PERLUNICODE to S or SD and such for the default I/I encoding. If you actually use UTF-8 as the encoding on the stream, though, you get the version that gives exceptions instead. In Python we don't usually use warnings for this kind of things (also we don't have things like use strict). I don't imagine most of the Python devel team knows Perl very well, and maybe not even Java or ICU. So I get the idea that there isn't as much awareness of Unicode in your team as there tends to be in those others. I would say there are at least 5-10 Unicode experts in our team. It might be true though that we don't always follow closely what other languages and the Unicode consortium do, but if people reports problem we are willing to fix them (so thanks for reporting them!). From my point of view, learning from other people's mistakes is a way to get ahead without incurring all the learning-bumps oneself, so if there's a way to do that for you, that could be to your benefit, and I'm very happy to share some of our blunders so you can avoid them yourselves. While I really appreciate the fact that you are sharing with us your experience, the solution found and applied in Perl might not always be the best one for Python (but it's still good to learn from others' mistakes). For example I don't think removing the 0x10 upper limit is going to happen -- even if it might be useful for other things. Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12743] C API marshalling doc contains XXX
Martin v. Löwis mar...@v.loewis.de added the comment: Would you just remove the XXX string, or the entire comment? XXX is typically used to indicate that something needs to be done, and the comment makes a clear statement as to what it is that needs to be done. -- nosy: +loewis ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12743 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12748] IDLE halts on osx when copy and paste
Ned Deily n...@acm.org added the comment: Chances are that you used the python.org 2.7.2 64-bit/32-bit installer but you did not install the latest ActiveState Tcl, currently 8.5.10, as documented here: http://www.python.org/download/mac/tcltk/ On OS X 10.6, there should have been a warning message about this in the IDLE shell window. The Apple-supplied Tcl/Tk 8.5 in both Mac OS X 10.6 and 10.7 have known problems as described in the web page above. Please try with the latest ActiveState Tcl installed and reopen this issue if that does not resolve the problems you see. -- assignee: ronaldoussoren - ned.deily resolution: - works for me stage: - committed/rejected status: open - pending ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12748 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12748] IDLE halts on osx when copy and paste
hy hoyeung...@gmail.com added the comment: Thanks but the problem is not completely solved I followed your instruction and I can now use mouse to click the menu to copy and paste without problems. But it still halts when using keyboard to do so. Is there a complete solution? -- resolution: works for me - wont fix status: pending - open ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12748 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Changes by Jeremy Kloth jeremy.kl...@gmail.com: -- nosy: +jkloth ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
New submission from Tom Christiansen tchr...@perl.com: On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: if re.search([풜-풵], 풞, re.UNICODE): print(match 1 passed) else: print(match 2 failed) The best you can possibly do is to use both a wide build *and* symbolic literals, in which case it will pass. But remove either of both of those conditions and you fail. This is too restrictive for full Unicode use. There should never be any sitation where [a-z] fails to match c when a c z, and neither a nor z is something special in a character class. There is, or perhaps should be, no difference at all between [a-z] and [풜-풵], just as there is, or at least should b, no difference between c and 풞. You can’t have second-class citizens like this that can't be used. And no, this one is *not* fixed by Matthew Barnett's regex library. There is some dumb UCS-2 assumption lurking deep in Python somewhere that makes this break, even on wide builds, which is incomprehensible to me. -- components: Regular Expressions files: bigrange.py messages: 142058 nosy: Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy priority: normal severity: normal status: open title: lib re cannot match non-BMP ranges (all versions, all builds) type: behavior versions: Python 3.2 Added file: http://bugs.python.org/file22897/bigrange.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: On a wide 2.7 and 3.3 all the 3 tests pass. On a narrow 3.2 I get match 1 passed Traceback (most recent call last): File /home/wolf/dev/py/3.2/Lib/functools.py, line 176, in wrapper result = cache[key] KeyError: (class 'str', '[풜-풵]', 32) During handling of the above exception, another exception occurred: Traceback (most recent call last): File bigrange.py, line 16, in module if re.search([풜-풵], 풞, flags): File /home/wolf/dev/py/3.2/Lib/re.py, line 158, in search return _compile(pattern, flags).search(string) File /home/wolf/dev/py/3.2/Lib/re.py, line 255, in _compile return _compile_typed(type(pattern), pattern, flags) File /home/wolf/dev/py/3.2/Lib/functools.py, line 180, in wrapper result = user_function(*args, **kwds) File /home/wolf/dev/py/3.2/Lib/re.py, line 267, in _compile_typed return sre_compile.compile(pattern, flags) File /home/wolf/dev/py/3.2/Lib/sre_compile.py, line 491, in compile p = sre_parse.parse(p, flags) File /home/wolf/dev/py/3.2/Lib/sre_parse.py, line 692, in parse p = _parse_sub(source, pattern, 0) File /home/wolf/dev/py/3.2/Lib/sre_parse.py, line 315, in _parse_sub itemsappend(_parse(source, state)) File /home/wolf/dev/py/3.2/Lib/sre_parse.py, line 461, in _parse raise error(bad character range) sre_constants.error: bad character range -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: On wide 3.2 it passes too, so the failure is limited to narrow builds (are you sure that it fails on wide builds for you?). On a narrow 2.7 I get a slightly different error though: match 1 passed Traceback (most recent call last): File bigrange.py, line 16, in module if re.search([풜-풵], 풞, flags): File /home/wolf/dev/py/2.7/Lib/re.py, line 142, in search return _compile(pattern, flags).search(string) File /home/wolf/dev/py/2.7/Lib/re.py, line 244, in _compile raise error, v # invalid expression sre_constants.error: bad character range -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: I haven't looked at the code, but I think that the re module is just trying to calculate the range between the low surrogate of 풜 and the high surrogate of 풵. If this is the case, this is the usual bug that narrow builds have. Also note that re.search(u[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}].encode('utf-8'), u\N{MATHEMATICAL SCRIPT CAPITAL C}.encode('utf-8'), re.UNICODE) works, but it returns a wrong result. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12748] IDLE halts on osx when copy and paste
Ned Deily n...@acm.org added the comment: That is encouraging. This is almost certainly a problem with Tk. The Cocoa Tcl/Tk 8.5 used by Apple and ActiveState has been known to have issues with composite characters. There are a couple of IDLE things to ask about first. Have you made any Custom Key Bindings for IDLE? Or added any IDLE extensions? Both of these would show up in your ~/.idlerc directory. On to Tk-related questions: Which OS X keyboard layout are you using? Are you using any Input Methods? (Both of these options are shown in System Preferences.) What keystrokes are used for the menu shortcuts that cause the hang? And, by hang, you mean that menu item changes color indicating that it is selected but IDLE freezes at that point? If you have the time and feel comfortable doing so, it would be helpful to know if the same problems are displayed using the older Carbon Tcl/Tk 8.4. You could temporarily move your current 2.7 installation out of the way by doing this in a Terminal shell: cd /Library/Frameworks/Python.framework/Versions sudo mv 2.7 2.7-SAVED cd /Applications sudo mv Python\ 2.7 Python\ 2.7-SAVED and then downloading and installing the 32-bit-only (10.3+) 2.7.2 installer from python.org. It is not necessary to install an ActiveState Tcl/Tk 8.4 for this. Note that if you have migrated to OS X 10.7 already, you probably will not want to stay with this version because it is not easy with Xcode 4 to install third-party Python packages that require building C extension modules. You can restore your previous Python by: cd /Library/Frameworks/Python.framework/Versions sudo mv 2.7-SAVED 2.7 cd /Applications sudo mv Python\ 2.7-SAVED Python\ 2.7 -- resolution: wont fix - stage: committed/rejected - ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12748 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: The error on 3.2 comes from the lru_cache, here's a minimal testcase to reproduce it: from functools import lru_cache @lru_cache() ... def func(arg): raise ValueError() ... func(3) Traceback (most recent call last): File /home/wolf/dev/py/3.2/Lib/functools.py, line 176, in wrapper result = cache[key] KeyError: (3,) During handling of the above exception, another exception occurred: Traceback (most recent call last): File stdin, line 1, in module File /home/wolf/dev/py/3.2/Lib/functools.py, line 180, in wrapper result = user_function(*args, **kwds) File stdin, line 2, in func ValueError Raymond, is this expected or should I open another issue? -- nosy: +rhettinger ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 07:15:09 -: Unicode says you can't put surrogates or noncharacters in a UTF-anything stream. It's a bug to do so and pretend it's a UTF-whatever. The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it. I'm a bit confused on this. You no longer fix bugs in Python 2? I've dug out the references that state that you are not allowed to do things the way you are doing them. This is from the published Unicode Standard version 6.0.0, chapter 3, Conformance. It is a very important chapter. http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Python is in violation of that published Standard by interpreting noncharacter code points as abstract characters and tolerating them in character encoding forms like UTF-8 or UTF-16. This explains that conformant processes are forbidden from doing this. Code Points Unassigned to Abstract Characters C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. · The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character. == C2 A process shall not interpret a noncharacter code point as an abstract character. · The noncharacter code points may be used internally, such as for sentinel val- ues or delimiters, but should not be exchanged publicly. C3 A process shall not interpret an unassigned code point as an abstract character. · This clause does not preclude the assignment of certain generic semantics to unassigned code points (for example, rendering with a glyph to indicate the position within a character block) that allow for graceful behavior in the pres- ence of code points that are outside a supported subset. · Unassigned code points may have default property values. (See D26.) · Code points whose use has not yet been designated may be assigned to abstract characters in future versions of the standard. Because of this fact, due care in the handling of generic semantics for such code points is likely to provide better robustness for implementations that may encounter data based on future ver- sions of the standard. Next we have exactly how something you call UTF-{8,16-32} must be formed. *This* is the Standard against which these things are measured; it is not the RFC. You are of course perfectly free to say you conform to this and that RFC, but you must not say you conform to the Unicode Standard when you don't. These are different things. I feel it does users a grave disservice to ignore the Unicode Standard in this, and sheer casuistry to rely on an RFC definition while ignoring the Unicode Standard whence it originated, because this borders on being intentionally misleading. Character Encoding Forms C8 When a process interprets a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall interpret that code unit sequence according to the corre- sponding code point sequence. ==· The specification of the code unit sequences for UTF-8 is given in D92. · The specification of the code unit sequences for UTF-16 is given in D91. · The specification of the code unit sequences for UTF-32 is given in D90. C9 When a process generates a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall not emit ill-formed code unit sequences. · The definition of each Unicode character encoding form specifies the ill- formed code unit sequences in the character encoding form. For example, the definition of UTF-8 (D92) specifies that code unit sequences such as C0 AF are ill-formed. == C10 When a process interprets a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. · For example, in UTF-8 every code unit of the form 1102 must be followed by a code unit of the form 10xx2. A sequence such as 110x2 0xxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant pro- cess must treat the first code unit 110x2 as an
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: On wide 3.2 it passes too, so the failure is limited to narrow builds (are = you sure that it fails on wide builds for you?). You're right: my wide build is not Python3, just Python2. In fact, it's even worse, because it's the stock build on Linux, which seems on this machine to be 2.6 not 2.7. I have private builds that are 2.7 and 3.2, but those are both narrow. I do not have a 3.3 build. Should I? I'm remembering why I removed Python2 from my Unicode talk, because of how it made me pull my hair out. People at the talk wanted to know what I meant, but I didn't have time to go into it. I think this gets added to the hairpulling list. --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 07:15:09 -: For example I don't think removing the 0x10 upper limit is going to happen -- even if it might be useful for other things. I agree entirely. That's why I appended a triple exclamation point to where I said I certainly do not expect this. It can only work fully on UTF-8ish systems and up to 32 bits on UTF-32, and it is most emphatically *not* Unicode. Yes, there are things you can do with it, but it risks serious misunderstanding and even noncomformance if not done very carefully. The Standard does not forbid such things internally, but you are not allowed to pass them around in noninternal streams claiming they are real UTF streams. Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO). One advantage of having an external library is the ability to update it asynchronously. Another is the possibility to swap in out altogether. Perl only gained that ability, which Python has always had, some four years ago with its 5.10 release. To my knowledge, the only thing people tend to use this for is to get Russ Cox's re2 library, which has very different performance characteristics and guarantees that allow it to be used in potential starvation denial-of-service situations that the normal Perl, Python, Java, etc regex engine cannot be safely used for. -tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: You're right: my wide build is not Python3, just Python2. And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3. In fact, it's even worse, because it's the stock build on Linux, which seems on this machine to be 2.6 not 2.7. What is worse? FWIW on my system the default `python` is a 2.7 wide. `python3` is a 3.2 wide. I have private builds that are 2.7 and 3.2, but those are both narrow. I do not have a 3.3 build. Should I? 3.3 is the version in development, not released yet. If you have an HG clone of Python you can make a wide build of 3.x with ./configure --with-wide-unicode andof 2.7 using ./configure --enable-unicode=ucs4. I'm remembering why I removed Python2 from my Unicode talk, because of how it made me pull my hair out. People at the talk wanted to know what I meant, but I didn't have time to go into it. I think this gets added to the hairpulling list. I'm not sure what you are referring to here. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Antoine Pitrou pit...@free.fr added the comment: I have private builds that are 2.7 and 3.2, but those are both narrow. I do not have a 3.3 build. Should I? I don't know if you *should*. But you can make one easily by passing --with-wide-unicode to ./configure. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Antoine Pitrou pit...@free.fr added the comment: The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it. I'm a bit confused on this. You no longer fix bugs in Python 2? In general, we try not to introduce changes that have a high probability of breaking existing code, especially when what is being fixed is a minor issue which almost nobody complains about. This is even truer for stable branches, and Python 2 is very much a stable branch now (no more feature releases after 2.7). That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2. Perhaps, but it is not Python's fault if the IETF and the Unicode consortium have disagreed on what UTF-8 should be. I'm not sure what people called UTF-8 when support for it was first introduced in Python, but you can't blame us for maintaining a consistent behaviour across releases. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Ezio Melotti ezio.melo...@gmail.com added the comment: I'm a bit confused on this. You no longer fix bugs in Python 2? We do, but it's unlikely that we will introduce major changes in behavior. Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e. 3.3), and not in the next bug fix release of 3.2 (i.e. 3.2.2). That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2. This IMHO should be fixed, but it's another issue. If you have not reread its Chapter 3 of late in its entirety, you probably want to do so. There is quite a bit of material there that is fundamental to any process that claims to be conformant with the Unicode Standard. I am familiar with the Chapter 3, but admittedly I only read the parts that were relevant to the bugs I was fixing. I never went through it checking that everything in Python matches the described behavior. Thanks for pointing out the parts were Python doesn't follow the specs. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12266] str.capitalize contradicts oneself
Ezio Melotti ezio.melo...@gmail.com added the comment: Attached patch + tests. -- keywords: +patch Added file: http://bugs.python.org/file22898/issue12266.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12266 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12611] 2to3 crashes when converting doctest using reduce()
Catalin Iacob iacobcata...@gmail.com added the comment: I looked at this and understood why it's happening. I don't know exactly how to fix it though, so here's what I found out. When a doctest appears in a docstring at line n in a file, RefactorTool.parse_block will return a tree corresponding to n - 1 newline characters followed by the code in the doctest. That tree is refactored by RefactoringTool.refactor_tree which usually returns n - 1 newline characters followed by the refactored doctest. However, for the reduce fixer, the tree returned by refactor_tree starts with from functools import reduce followed by n - 1 newline characters and then the doctest reduce line. The failing assert happens when stripping those newlines because they are expected to be at the beginning of the output while in reality they're after the import line. So the problem is a mismatch between the expectations of the doctest machinery (refactoring code that starts with some newlines results in code that starts with the same number of newlines) and the reduce fixer which adds an import, imports are added at the beginning of the file, therefore something appears before the newlines. Other fixers could exhibit the same problem. -- nosy: +catalin.iacob ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12611 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12740] Add struct.Struct.nmemb
Stefan Krah stefan-use...@bytereef.org added the comment: I like random tests in the stdlib, otherwise the same thing gets tested over and over again. `make buildbottest` prints the seed, and you can do it for a single test as well: $ ./python -m test -r test_heapq Using random seed 5857004 [1/1] test_heapq 1 test OK. It looks like the choice is between s.nmembers and len(s). I thought about len(s), but since Struct.pack() returns a bytes object, this might be confusing. Struct.arity may be another option. This also reflects that pack() will be an n-ary function for the given format string (and that Struct is a packing object, not really a struct itself). Still, probably I'm +0.5 on 'nmembers' compared to the other options. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12740 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12740] Add struct.Struct.nmemb
Antoine Pitrou pit...@free.fr added the comment: It looks like the choice is between s.nmembers and len(s). I thought about len(s), but since Struct.pack() returns a bytes object, this might be confusing. I agree there's a risk of confusion between len()-number-of-elements and size()-number-of-bytes. We have a similar confusion with the memoryview object and in retrospect it's often quite misleading. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12740 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:15:52 -: You're right: my wide build is not Python3, just Python2. And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3. Perhaps I am doing something wrong? linux% python --version Python 2.6.2 linux% python -c 'import sys; print sys.maxunicode' 1114111 linux% cat -n bigrange.py 1 #!/usr/bin/env python 2 # -*- coding: UTF-8 -*- 3 4 from __future__ import print_function 5 from __future__ import unicode_literals 6 7 import re 8 9 flags = re.UNICODE 10 11 if re.search([a-z], c, flags): 12 print(match 1 passed) 13 else: 14 print(match 1 failed) 15 16 if re.search([풜-풵], 풞, flags): 17 print(match 2 passed) 18 else: 19 print(match 2 failed) 20 21 if re.search([\U0001D49C-\U0001D4B5], \U0001D49E, flags): 22 print(match 3 passed) 23 else: 24 print(match 3 failed) 25 26 if re.search([\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}], 27 \N{MATHEMATICAL SCRIPT CAPITAL C}, flags): 28 print(match 4 passed) 29 else: 30 print(match 4 failed) linux% python bigrange.py match 1 passed Traceback (most recent call last): File bigrange.py, line 16, in module if re.search([풜-풵], 풞, flags): File /usr/lib64/python2.6/re.py, line 142, in search return _compile(pattern, flags).search(string) File /usr/lib64/python2.6/re.py, line 245, in _compile raise error, v # invalid expression sre_constants.error: bad character range In fact, it's even worse, because it's the stock build on Linux, which seems on this machine to be 2.6 not 2.7. What is worse? FWIW on my system the default `python` is a 2.7 wide. `python3` is a 3.2 wide. I meant that it was running 2.6 not 2.7. I have private builds that are 2.7 and 3.2, but those are both narrow. I do not have a 3.3 build. Should I? 3.3 is the version in development, not released yet. If you have an HG clone of Python you can make a wide build of 3.x with ./configure --with-wide-unicode andof 2.7 using ./configure --enable- unicode=ucs4. And Antoine Pitrou pit...@free.fr wrote: I have private builds that are 2.7 and 3.2, but those are both narrow. I do not have a 3.3 build. Should I? I don't know if you *should*. But you can make one easily by passing --with-wide-unicode to ./configure. Oh good. I need to read configure --help more carefully next time. I have to some Lucene work this afternoon, so I can let several builds chug along. Is there a way to easily have these co-exist on the same system? I'm sure I have to rebuild all C extensions for the new builds, but I wonder what to about (for example) /usr/local/lib/python3.2 being able to be only one of narrow or wide. Probably I just to go reading the configure stuff better for alternate paths. Unsure. Variant Perl builds can coexist on the same system with some directories shared and others not, but I often find other systems aren't quite that flexible, usually requiring their own dedicated trees. Manpaths can get tricky, too. I'm remembering why I removed Python2 from my Unicode talk, because of how it made me pull my hair out. People at the talk wanted to know what I meant, but I didn't have time to go into it. I think this gets added to the hairpulling list. I'm not sure what you are referring to here. There seem to many more things to get wrong with Unicode in v2 than in v3. I don't know how much of this just my slowness at ramping up the learning curve, how much is due to historical defaults that don't work well for Unicode, and how much is Python2: re.search(u[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}].encode('utf-8'), u\N{MATHEMATICAL SCRIPT CAPITAL C}.encode('utf-8'), re.UNICODE) Python3: re.search([\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}], \N{MATHEMATICAL SCRIPT CAPITAL C}, re.UNICODE) The Python2 version is *much* noisier. (1) You have keep remembering to u... everything because neither # -*- coding: UTF-8 -*- nor even from __future__ import unicode_literals suffices. (2) You have to manually encode every string, which is utterly bizarre to me. (3) Plus you then have turn around and tell re, Hey by the way, you know those Unicode strings I just passed you? Those are Unicode strings, you know. Like it couldn't tell that already by realizing it got Unicode not byte strings. So weird. It's a very awkward model. Compare Perl's \N{MATHEMATICAL SCRIPT CAPITAL C} =~ /\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/ That's the kind of thing I'm used
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:46:55 -: I'm a bit confused on this. You no longer fix bugs in Python 2? We do, but it's unlikely that we will introduce major changes in behavior. Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e. 3.3), and not in the next bug fix release of 3.2 (i.e. 3.2.2). Antoine Pitrou rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:36:42 -: This is even truer for stable branches, and Python 2 is very much a stable branch now (no more feature releases after 2.7). Does that mean you now go to 2.7.1, 2.7.2, etc? I had thought that 2.6 was going to be the last, but then 2.7 ame out. I think I remember Guido said something about there never being a 2.10, so I wasn't too surprised to see 2.7. --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Matthew Barnett pyt...@mrabarnett.plus.com added the comment: On a narrow build, \N{MATHEMATICAL SCRIPT CAPITAL A} is stored as 2 code units, and neither re nor regex recombine them when compiling a regex or looking for a match. regex supports \xNN, \u and \U and \N{XYZ} itself, so they can be used in a raw string literal, but it doesn't recombine code units. I could add recombination to regex at some point if time has passed and no further progress has been made in the language's support for Unicode. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: Perhaps I am doing something wrong? That's weird, I tried on a wide Python 2.6.6 too and it works even there. Maybe a bug that got fixed between 2.6.2 and 2.6.6? Or maybe something else? Is there a way to easily have these co-exist on the same system? Here I have different HG clones, one for each release (2.7, 3.2, 3.3), and I run ./configure (--with-wide-unicode) make -j2. Then I just run ./python from there without installing it in the system. You might do the same or look at make altinstall. If you run make install it will install it as the default Python, so that's probably what you want. Another option is to use virtualenv. The Python2 version is *much* noisier. Yes, Python 3 fixed many of these things and it's a much cleaner language. (1) You have keep remembering to u... everything because neither # -*- coding: UTF-8 -*- nor even from __future__ import unicode_literals suffices. Before Unicode Python only had plain (byte)strings, when Unicode strings were introduced the u... syntax was chosen to distinguish them. On Python 3, ... is a Unicode string, whereas b... is used for bytes. # -*- coding: UTF-8 -*- is only about the encoding used to save the file, and doesn't affect other things. Also this is the default on Python 3 so it's not necessary anymore (it's ASCII (or iso-8859-1?) on Python2). from __future__ import unicode_literals allows you to use ... and b... instead of u... and ... on Python 2. In my example I used u... to be explicit and because I was running from the terminal without using unicode_literals. (2) You have to manually encode every string, which is utterly bizarre to me. re works with both bytes and Unicode strings, on both Python 2 and Python 3. I was encoding them to see if it was able to handle the range when it was in a UTF-8 encoded string, rather than a Unicode string. Even if it didn't fail with an exception, it failed with a wrong result (and that's even worse). (3) Plus you then have turn around and tell re, Hey by the way, you know those Unicode strings I just passed you? Those are Unicode strings, you know. Like it couldn't tell that already by realizing it got Unicode not byte strings. So weird. The re.UNICODE flags affects the behavior of e.g. \w and \d, it's not telling re that we are passing Unicode strings rather than bytes. By default on Python 2 those only match ASCII letters and digits. This is also fixed on Python 3, where by default they match non-ASCII letters and digits (unless you pass re.ASCII). * Requiring explicitly coded callouts to a library are at best tedious and annoying. ICU4J's UCharacter and JDK7's Character classes both have String getName(int codePoint) FWIW we have unicodedata.lookup('SNOWMAN') One question: If one really must use code point numbers in strings, does Python have any clean uniform way to enter them besides having to choose the clunky \u vs \U thing? Nope. OTOH it doesn't happen to often to use those (especially the \U version), so I'm not sure that it's worth adding something else just to save a few chars (also \x{12345} is only one char less than \U00012345). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Ezio Melotti ezio.melo...@gmail.com added the comment: 2.7 is the last 2.x. There won't be any 2.8 (also I never heard that 2.6 was supposed to be the last). We already have 2.7.2, and we will continue with 2.7.3, 2.7.4, etc for a few more years. Eventually 2.7 will only get security fixes and the development will be focused on 3.x only. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)
Ezio Melotti ezio.melo...@gmail.com added the comment: BTW, you can find more information about the one-dir-per-clone setup (and other useful info) here: http://docs.python.org/devguide/committing.html#using-several-working-copies -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12749 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10744] ctypes arrays have incorrect buffer information (PEP-3118)
Stefan Krah stefan-use...@bytereef.org added the comment: Thanks for the patch. I agree with the interpretation of the format string. One thing is unclear though: Using this interpretation the multi-dimensional array notation in format strings only seems useful for pointers to arrays. The PEP isn't so clear on that, would you agree? I'm not done reviewing the patch, just a couple of nitpicks: - We need a function declaration of _ctypes_alloc_format_string_with_shape() in ctypes.h. - prefix_len = 32*(ndim+1) + 3: This is surely sufficient, but (ndim+1) is not obvious to me. I think we need (20 + 1) * ndim + 3. - I'd use %zd for Py_ssize_t (I know that in other parts of the code %ld is used, too). -- assignee: theller - stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10744 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12740] Add struct.Struct.nmemb
Stefan Krah stefan-use...@bytereef.org added the comment: Just to throw in a new name: Struct.nitems would also be possible. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12740 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11835] python (x64) ctypes incorrectly pass structures parameter
Vlad Riscutia riscutiav...@gmail.com added the comment: Attached patch for this issue. This only happens on MSVC x64 (I actually tired to repro on Arch Linux x64 before starting work on it and it didn't repro). What happens is that MSVC on x64 always passes structures larger than 8 bytes by reference. See here: http://msdn.microsoft.com/en-us/library/ms235286(v=vs.90).aspx Now this was accounted for in callproc.c, line 1143 in development branch with this: if (atypes[i]-type == FFI_TYPE_STRUCT #ifdef _WIN64 atypes[i]-size = sizeof(void *) #endif ) avalues[i] = (void *)args[i].value.p; else avalues[i] = (void *)args[i].value; This fix wasn't made in libffi_msvc/ffi.c though. Here, regardless of whether we have x64 or x86 build, if z = sizeof(int) we will hit else branch in libffi_msvc/ffi.c at line 114 and do: else { memcpy(argp, *p_argv, z); } p_argv++; argp += z; In our case, we copy 28 bytes as arguments (size of our structure) but in fact for x64 we only need 8 as structure is passed by reference so argument is just a pointer. My patch will adjust z before hitting if statement on x64 and it will cause correct copy as pointer. -- nosy: +vladris Added file: http://bugs.python.org/file22899/issue11835_patch.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11835 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy tjre...@udel.edu added the comment: Tom, I appreciate your taking the time to help us improve our Unicode story. I agree that the compromises made a decade ago need to be revisited and revised. I think it will help if you better understand our development process. Our current *intent* is that 'Python x.y' be a fixed language and that 'CPython x.y.0', '.1', '.2', etc be increasingly (and strictly -- no regressions) better implementations of Python x.y. (Of course, the distribution and installation names and up-to-now dropping of '.0' may confuse the distinction, but it is real.) As a consequence, correct Python x.y code that runs correctly on the CPython x.y.z implementation should run correctly on x.y.(z+1). For the purpose of this tracker, a behavior issue ('bug') is a discrepancy between the documented intent of a supported Python x.y and the behavior of the most recent CPython x.y.z implementation thereof. A feature request is a design issue, a request for a change in the language definition (and in the corresponding .0 implementation). Most people (including you, obviously) that file feature requests regard them as addressing design bugs. But still, language definition bugs are different from implementation bugs. Of course, this all assumes that the documents are correct and unambiguous. But accomplishing that can be as difficult as correct code. Obvious mistakes are quickly corrected. Ambiguities in relation to uncontroversial behavior are resolved by more exactly specifying the actual behavior. But ambiguities about behavior that some consider wrong, are messy. We can consult the original author if available, consult relevant tests if present, take votes, but some fairly arbitrary decision may be needed. A typical response may be to clarify behavior in the docs for the current x.y release and consider behavior changes for the next x.(y+1) release. So the answer to your question, Do we fix bugs?, is that we fix doc and implementation behavior bugs in the next micro x.y.z behavior bug-fix release and language design bugs in the next minor x.y language release. But note that language changes merely have to be improvements for Python in the future without necessarily worrying about whether a design decision made years ago was or is a 'bug'. The purpose of me discussing or questioning the 'type' of some of your issues is to *facilitate* change by getting the issue on the right track, in relation to our development process, as soon as possible. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11835] python (x64) ctypes incorrectly pass structures parameter
Changes by Stefan Krah stefan-use...@bytereef.org: -- nosy: +amaury.forgeotdarc, belopolsky ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11835 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy tjre...@udel.edu added the comment: This is off-topic, but there was discussion on whether or not to have a 2.7. The decision was to focus on back-porting things that would make the eventual transition to 3.x easier. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12740] Add struct.Struct.nmemb
Raymond Hettinger raymond.hettin...@gmail.com added the comment: In general, I think we can prevent confusion about the meaning of __len__ by sticking to the general rule: len(object)==len(list(obj)) for anything that produces an iterable result. In the case of struct, that would be the length of the tuple returned by struct.unpack() or the number of values consumed by struct.pack(). This choice is similar to what was done for collections.Counter where len(Counter(a=10, b=20)) returns 2 (the number of dict keys) rather than 30 (the number of elements in the Bag-like container). A similar choice was make for structseq objects when len(ss) == len(iter(ss)) despite there being other non-positional names that are retrievable. It's true that we get greater clarity by spelling out the specific meaning in the context of structs, as in s.num_members or some such, but we start to lose the advantages of polymorphism and ease of learning/remembering that comes with having consistent API choices. For any one API such as structs, it probably makes sense to use s.num_members, but for the standard library as a whole, it is probably better to try to make len(obj) have a consistent meaning rather than having many different names for the size of the returned tuple. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12740 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12748] IDLE halts on osx when copy and paste
hy hoyeung...@gmail.com added the comment: Thank you. I kinda know what happens now. First, I didn't made any change to IDLE after installed. Second, I'm using dvorak-qwerty. Normally the keyboard layout changes to qwerty when I press Cmd key so that I can type in Dvorak and use the short cut in qwerty. But in IDLE it's not the same case. I find that the halt problem only occur when I copy. So I tried cut and paste. It happens that I can use both Cmd+x and Cmd+b (x in Dvorak layout) to cut and both Cmd+v and Cmd+.(v in Dvorak layout) to paste. So if I press Cmd+c, I'm inputting both Cmd+c and Cmd+j at the same time. And I think that's the reason why it halts. By hang, it's exactly what you described. Also, i tried Tcl/Tk 8.4, the same problem happens. It's weird since I don't have this problem in Windows when I use a third-party dvorak-qwerty input method. I temporally changed to Dvorak now to avoid this problem, although there is a little bit inconvenient since all shortcut has changed. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12748 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12672] Some problems in documentation extending/newtypes.html
Terry J. Reedy tjre...@udel.edu added the comment: I agree that the sentence is a bit confusing and the 'object method' ambiguous. I suspect that the sentence was written years ago. In current Python, [].append is a bound method of class 'builtin_function_or_method'. I *suspect* that the intended contrast, and certainly the important one, is that between C functions, which get added to PyTypeObject structures, and their Python object wrappers that are visible from Python, but which must not be put in the type structure. The varieties of wrappers are irrelevant in this context and for the purpose of avoiding that mistake. So I would rewrite the sentence as: These C functions are called “type methods” to distinguish them from Python wrapper objects, such as ``list.append`` or ``[].append``, visible in Python code. Looking further down, Now if you go and look up the definition of PyTypeObject in object.h you’ll see that it has many more fields that the definition above., needs 'that' changed to 'than' and I would insert following tp_doc after 'fields'. -- nosy: +terry.reedy ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12672 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy tjre...@udel.edu added the comment: Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should. *While I think this design decision may have been OK a decade ago for a first implementation of an *optional* text type, I do not think it so for the future for revised implementations of what is now *the* text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea: If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.) This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python. PS: The OSCON link in msg142036 currently gives me 404 not found -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12738] Bug in multiprocessing.JoinableQueue() implementation on Ubuntu 11.04
Michael Hall michaelhal...@gmail.com added the comment: I tried switching from joining on the work_queue to just joining on the individual child processes, and it seems to work now. Weird. Anyway, it'd be nice to see the JoinableQueue fixed, but it's not pressing any more. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12738 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Matthew Barnett pyt...@mrabarnett.plus.com added the comment: Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11835] python (x64) ctypes incorrectly pass structures parameter
Vlad Riscutia riscutiav...@gmail.com added the comment: Changing type to behavior as it doesn't crash on 3.3. I believe issue was opened against 2.6 and Santoso changed it to 2.7 and up where there is no crash. Another data point: there is similar fix in current version of libffi here: https://github.com/atgreen/libffi/blob/master/.pc/win64-struct-args/src/x86/ffi.c Since at the moment we are not integrating new libffi, I believe my fix should do (libffi fix is slightly different but I'm matching what we have in callproc.c which is not part of libffi). -- type: crash - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11835 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: Terry J. Reedy rep...@bugs.python.org wrote on Mon, 15 Aug 2011 00:26:53 -: PS: The OSCON link in msg142036 currently gives me 404 not found Sorry, I wrote http://training.perl.com/OSCON/index.html but meant http://training.perl.com/OSCON2011/index.html I'll fix it on the server in a short spell. I am trying to keep the document up to date as I learn more, so it isn't precisely the talk I gave in Portland. Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results. Can you think of anywhere that Python acts like UCS-2 and not UTF-16? I'm not sure I have found one, although the regex thing might count. Thank you guys for being so helpful and understanding. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. That's what Java did, too, and for the same reason. Because they had a UCS-2 implementation for Unicode 1.1 so when Unicode 2.0 came out and they learned that they would need more than 16 bits, they piggybacked UTF-16 onto the top of it instead of going for UTF-8 or UTF-32, and they're still paying that price, and to my mind, heavily and continually. Do you use Java? It is very like Python in many of its 16-bit character issues. Most of the length and indexing type functions address things by code unit only, not copepoint. But they would never claim to be UCS-2. Oh, I realize why they did it. For one thing, they had bytecode out there that they had to support. For another, they had some pretty low-level APIs that didn't have enough flexibility of abstraction, so old source had to keep working as before, even though this penalized the future. Forever, kinda. While I wish they had done better, and kinda think they could have, it isn't my place to say. I wasn't there (well, not paying attention) when this was all happening, because I was so underwhelmed by the how annoyingly overhyped it was. A billion dollars of marketing can't be wrong, you know? I know that smart people looked at it, seriously. I just find the cure they devised to be more in the problem set than the solution set. I like how Python works on wide builds, especially with Python3. I was pretty surprised that the symbolic names weren't working right on the earlier version of the 2.6 wide build I tried them on. I know have both wide and narrow builds installed of both 2.7 and 3.2, so that shouldn't happen again. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode- capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. Count me as one of those many/most others who disagree. :) The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should. It is always better to deliver more than you say than to deliver less. * While I think this design decision may have been OK a decade ago for a first implementation of an *optional* text type, I do not think it so for the future for revised implementations of what is now *the* text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Yes, I think so, too. If you look at the growth curve of UTF-8 alone, it has followed a mathematically exponential growth curve in the first decade of this century. I suspect that will turn into an S surve with with aymtoptotic shoulders any time now. I haven't looked at it lately, so maybe it already has. I know that huge corpora I work with at work are all absolutely 100% Unicode now. Thank XML for that. Here is my current idea: If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.) This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in
[issue12693] test.support.transient_internet prints to stderr when verbose is false
Brett Cannon br...@python.org added the comment: The line from the source I am talking about is http://hg.python.org/cpython/file/49e9e34da512/Lib/test/support.py#l943 . And as for the output: ./python.exe -m test -uall test_ssl [1/1] test_ssl Resource 'ipv6.google.com' is not available 1 test OK. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12693 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12750] datetime.datetime timezone problems
New submission from Daniel O'Connor dar...@dons.net.au: It isn't possible to add a timezone to a naive datetime object which means that if you are getting them from some place you can't directly control there is no way to set the TZ. eg pywws' DataStore returns naive datetime's which are in UTC. There is no way to set this and hence strftime seems to think they are in local time. I can sort of see why you would disallow changing a TZ once set but it doesn't make sense to prevent this for naive DTs. Also, utcnow() returns a naive DT whereas it would seem to be more sensible to return it with a UTC TZ. -- components: Library (Lib) messages: 142095 nosy: Daniel.O'Connor priority: normal severity: normal status: open title: datetime.datetime timezone problems type: feature request versions: Python 2.7 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12750 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Tom Christiansen tchr...@perl.com added the comment: I wrote: Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results. Can you think of anywhere that Python acts like UCS-2 and not UTF-16? I'm not sure I have found one, although the regex thing might count. I just thought of one. The casemapping functions don't work right on Deseret, which is a non-BMP case-changing scripts. That's one I submitted as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP code points into paired UTF-16 surrogates, then the casing functions had jolly well be able to deal with it. If the UTF-8 decoder knows it is only going to UCS-2, then it should have raised on exception on my non-BMP source. Since it went to UTF-16, the rest of the language should have behaved accordingly. Java does to this right, BTW, despite its UTF-16ness. --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy tjre...@udel.edu added the comment: It is always better to deliver more than you say than to deliver less. Except when promising too little is a copout. Everyone always talks about important they're sure O(1) access must be, I thought that too until your challenge. But now that you mention it, indexing is probably not the bottleneck in most document processing. We are optimizing without measuring! We all know that is bad. If done transparently, non-O(1) indexing should only be done when it is *needed*. And if it is a bottleneck, switch to a wide build -- or get a newer, faster machine. I first used Python 1.3 on a 10 megahertz DOS machine. I just got a multicore 3.+ gigahertz machine. Tradeoffs have changed and just as we use cycles (and space) for nice graphical interfaces, we should use some for global text support. In the same pair of machines, core memory jumped from 2 megabytes to 24 gigabytes. (And the new machine cost perhaps as much in adjusted dollars.) Of course, better unicode support should come standard with the OS and not have to be re-invented by every language and app. Having promised to actually 'work on a prototype in Python', I decided to do so before playing. I wrote the following test: tucs2 = 'A\U0001043cBC\U0001042f\U00010445DE\U00010428H' tutf16= UTF16(tucs2) tlist = ['A', '\U0001043c','B','C','\U0001042f','\U00010445', 'D','E','\U00010428','H'] tlis2 = [tutf16[i] for i in range(len(tlist))] assert tlist == tlis2 and in a couple hours wrote and debugged the class to make it pass (and added a couple of length tests). See the uploaded file. Adding an __iter__ method to iterate by characters (with hi chars returned as wrapped length-1 surrogate pairs) instead of code units would be trivial. Adding the code to __getitem__ to handle slices should not be too hard. Slices containing hi characters should be wrapped. The cpdex array would make that possible without looking at the whole slice. The same idea could be used to index by graphemes. For European text that used codepoints for pre-combined (accented) characters as much as possible, the overhead should not be too much. This may not be the best issue to attach this to, but I believe that improving the narrow build would allow fixing of the re/regex problems reported here. -- Added file: http://bugs.python.org/file22900/utf16.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12672] Some problems in documentation extending/newtypes.html
Changes by Terry J. Reedy tjre...@udel.edu: -- stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12672 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Ezio Melotti ezio.melo...@gmail.com added the comment: Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)? s[0] # error here too? list(s) # here too? p = s + '\udc00' len(p) # 1? s[0] # '\U0001' ? s[1] # IndexError? list(p + 'a') # ['\ud800\udc00', 'a']? We can still decide that strings with lone surrogates work only with a limited number of methods/functions but: 1) it's not backward compatible; 2) it's not very consistent Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined on the fly: p '\ud800\udc00' len(p) 2 p.encode('utf-16').decode('utf-16') ' ' len(_) 1 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12729 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com