[issue12819] PEP 393 - Flexible Unicode String Representation
New submission from Torsten Becker torsten.bec...@gmail.com: I have started an implementation of PEP 393 -- Flexible String Representation [1] on bitbucket [2]. Not all code is ported to use the new API yet, but the interpreter starts with the new unicode representation, all unit tests pass, and some micro benchmarks show potential. Please see the related wiki page [3] for details of my implementation. [1]: http://www.python.org/dev/peps/pep-0393/ [2]: https://bitbucket.org/t0rsten/pep-393 [2]: http://wiki.python.org/moin/SummerOfCode/2011/PEP393 -- components: Unicode files: pep-393-aug22.diff keywords: patch messages: 142741 nosy: torsten.becker priority: normal severity: normal status: open title: PEP 393 - Flexible Unicode String Representation type: feature request versions: Python 3.3 Added file: http://bugs.python.org/file23004/pep-393-aug22.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12819 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: Hi, Jesús, I merged the patch up in the branches startswith-slices-issue11828-3.2 [1] and startswith-slices-issue11828-3.3 [2] in my hg repository. [1]: https://bitbucket.org/t0rsten/cpython/changeset/49028581e43a [2]: https://bitbucket.org/t0rsten/cpython/changeset/eafafe258362 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: I pushed my changes to a hg repository, they are in the two branches startswith-slices-issue11828-2.7 and startswith-slices-issue11828-3.1. -- hgrepos: +21 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: Added file: http://bugs.python.org/file21706/2b48fd451c85.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: -- hgrepos: +22 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: Removed file: http://bugs.python.org/file21706/2b48fd451c85.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: Hi, here is my revised patch with email.utils.getaddresses() also decoding IDNs. I decided to integrate IDN decoding in AddrlistClass.getaddress() instead of AddrlistClass.getaddrlist() since that function is one level lower and if somebody should ever all it directly, the conversion would not happen. I also fixed a glitch in the docs, versionchanged seems to need two colons to end up in the generated HTML. As a follow up, wouldn't it be helpful if email.Message would do the conversions directly? So when you parse a mail into a Message and access the To field, you get a list of tuples which are decoded properly? For example the following test currently still fails because the quoted header value is not decoded by email.feedparser.FeedParser nor email.Message: def test_email_decodes_idns_and_unicode(self): text = '''\ To: =?utf-8?b?SMOkbnMgV8O8cnN0?= h...@xn--dm-fka.ain Hello World!''' msg = Parser().parsestr(text) self.assertEqual(utils.getaddresses(msg.get_all('To')), [('H\xe4ns W\xfcrst', 'hans@d\xf6m.ain')]) Am I using the package wrong here or is this actually missing? email.header.decode_header seems to be able to do this already but it is not used. Would it be safe to integrate this into the email.message._sanitize_header helper? -- Added file: http://bugs.python.org/file21698/issue-11783-v4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: (The word anybody made me think. But fix properly ... i'm sure you cannot refer to myself. :)) fix properly referred to my inferior implementation and anybody should probably have been worded Steffen or David. So sure .. go ahead. :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: Some comments posted in the review. I'm not sure if my review reply got mailed as I did not get a copy and nothing showed up here. I added some responses/follow up questions in the review. Could you possibly post a patch for 2.7 too?. Sure, I'll write the next version against 3.3 and 2.7 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: I got your comments, Torsten. I finds funny too that the tracker is not notified. I wrote new comments too, but not using the right way, so now I am the one not sure you got them... :-) That time I actually got a separate mail. :) Better to have a 3.1/2.7 patch. The current workflow requires to patch the old version first (3.1), and up-port the change to 3.2 and 3.3. So, 2.7 and 3.1 would be more useful. Al least if the patch applies to 3.2 and 3.3 easily. If major surgery is needed, let me know. I uploaded an improved v4 patch against 2.7 and 3.1. patch does not apply it cleanly in the 3.2 and 3.3 branches, though. This is mostly because Objects/stringlib/find.h has changed too much and the #define STRINGLIB_IS_UNICODE (3.3, 3.2) is called FROM_UNICODE in 3.1. The other files work fine. It should be no problem to merge this up by hand, though. PS: If you use mercurial, try to upload the patch directly from it. See the Remote hg repo box. I'm using Mercurial, but unfortunately hg hangs forever when trying to push to bitbucket, so I am just sticking with patches for now. -- Added file: http://bugs.python.org/file21655/issue-11828-v4-3.1.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: Added file: http://bugs.python.org/file21656/issue-11828-v4-2.7.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: OK, so when I went to apply this, I figured out that the patch isn't quite right. I've redone the doc updates, and am attaching a version of the patch containing them. The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass. Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used. Do you want to tackle this, Torsten? I would like to, but I probably will not get to it before Monday. So if anybody wants to work on this before that time, please feel free to fix it properly. :) Just two questions for the implementation: 1. Would it be fine to move the helper _encode_decode_addr() into _parseaddr.py and then import it in util.py, so it can be shared between the two? 2. Would line 232 in _parseaddr.py (AddrlistClass.getaddrlist) be a good place to integrate it? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: Just realized that part of my v1 patch did not conform to PEP 7, I hope, I fixed that in v2. Please also excuse for the wrong name of the error message patch, it was supposed to be named issue-11828-error-msg-tests.patch. -- Added file: http://bugs.python.org/file21626/issue-11828-v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: Added file: http://bugs.python.org/file21627/issue-11828-error-msg-tests.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Changes by Torsten Becker torsten.bec...@gmail.com: Removed file: http://bugs.python.org/file21623/issue-8282-error-message-tests.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: Hi, since nobody stopped me by complaining about the approach or the first patch, I now fixed this for bytes and bytearray as well. :) I renamed the old _ParseTupleFinds function to stringlib_parse_tuple_finds, added a parameter for function name, and another if it shall do unicode conversion. I used this helper function throughout all 3 files now. I am new to writing C code for Python, so any comments on how to improve the patch are welcome. -- Added file: http://bugs.python.org/file21629/issue-11828-v3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: modulo some English wording that I'll fix up when I commit it. Yeah, sorry for that, I seem to have trouble with writing good documentation. :) I'll have a look at the documents referenced by [1] to improve my writing. The issue with the '@' is that it might not be there. I added a fix and a test for this in v2. However, when reading through the RFC [2] and Wikipedia [3], it seems like this is not actually allowed. Is there a way to internationalize the local-part as well? That is the only part which is missing now that domain and real name are covered. [1]: http://docs.python.org/devguide/docquality.html [2]: http://tools.ietf.org/html/rfc5322#section-3.4 [3]: http://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses -- Added file: http://bugs.python.org/file21614/issue-11783-v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: Hi, I started working on a first patch for this. A function _ParseTupleFinds() exists which does the proper parsing for this kind of arguments in unicodeobject.c, I adapted it to be usable for startswith() and endswith() besides find() and friends. In issue-8282-v1.patch I fixed this for startswith() and endswith(). count() suffered from the same behavior and I updated it there as well. -- keywords: +patch nosy: +torsten.becker Added file: http://bugs.python.org/file21620/issue-11828-v1.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11828] startswith and endswith don't accept None as slice index
Torsten Becker torsten.bec...@gmail.com added the comment: While working on this, I discovered anther problem. find(), etc. all use the same parsing function (_ParseTupleFinds()). So when an error occurs, the exception message will always start with find() even though index() or rfind() might have caused the error: asd.index(x, None, None, None) TypeError: find() takes at most 3 arguments (4 given) I attached a patch (issue-8282-error-message-tests.patch) which adds test cases for the wrong error messages. I was thinking about fixing this as well but wanted make sure my approach is correct first: - I would like to add another argument to _ParseTupleFinds(): const char * function_name - in _ParseTupleFinds(): allocate a buffer of 50 chars on the stack to hold O|OO: + function name - copy O|OO: into buffer - copy max(strlen(function_name), 44) chars from function_name into buffer - use buffer as format argument of PyArg_ParseTuple() - change all calls of _ParseTupleFinds to include the function name as first argument Would that approach work with Python's C style or are there any Python-specific helper functions I could use? -- Added file: http://bugs.python.org/file21623/issue-8282-error-message-tests.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11828 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: Have a nice weekend! Thank you for the wishes, I hope yours is going well, too! I added IDNA awareness to formataddr() and parseaddr(), updated the docs and wrote 2 tests for it. I wasn't sure if the IDNA awareness should be optional via a argument or always automatically enabled, I favored the latter. Also, is it safe to split at @ and encode/decode the last component? I am not familiar with all the weird variants a email address could be in strictly after the RFCs. -- keywords: +patch Added file: http://bugs.python.org/file21595/issue-11783-v1.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11783] email parseaddr and formataddr should be IDNA aware
Torsten Becker torsten.bec...@gmail.com added the comment: I was about to look into this over the weekend, but of course I don't want to steal your fun, Steffen. :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue11783 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1690608] email.utils.formataddr() should be rfc2047 aware
Torsten Becker torsten.bec...@gmail.com added the comment: Hi David, thank you for polishing up the patch and committing it. :) I am glad I could help and I was actually about to ask you if you knew any follow-up issues. I'll definitely continue contributing as time allows. I did not submit the agreement yet, but I'll look into that ASAP. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1690608 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8269] Missing return values for PyUnicode C/API functions
Torsten Becker torsten.bec...@gmail.com added the comment: Hi, I read through unicodeobject.c and added the (IMO) proper reference counts to the missing functions. I attached a first patch which adds this to Doc/data/refcounts.dat. The patch also fixes 2 minor glitches in Doc/c-api/unicode.rst, PyUnicode_DecodeMBCSStateful stated int instead of Py_ssize_t for it's arguments and PyUnicode_FromString had it's return value wrongly formated. -- keywords: +patch nosy: +torsten.becker Added file: http://bugs.python.org/file21514/issue-8269-v1.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8269 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1690608] email.utils.formataddr() should be rfc2047 aware
Torsten Becker torsten.bec...@gmail.com added the comment: I incorporated that change as well. My rationale behind the previous version was to be consistent with how Lib/email/header.py handled this, unfortunately I did not look around in the other classes and didn't think about that kind of compatibility. When formataddr() is called with a object which is not a string and which does not have a header_encode it will raise the following exception now: AttributeError: 'CharsetMock' object has no attribute 'header_encode' Thank you for your patience, sorry that it took probably more of your time by taking 4 iterations for this patch than if you had just implemented it yourself. -- Added file: http://bugs.python.org/file21436/issue-1690608-v4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1690608 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1690608] email.utils.formataddr() should be rfc2047 aware
Torsten Becker torsten.bec...@gmail.com added the comment: I implemented a basic test for the issue and an attempt for a fix. I am not entirely sure with my implementation, specifically I would like to get comments concerning the following points: - Is is OK that formataddr() will now check if address is ascii safe and if not it will raise a UnicodeEncodeError? - I was not sure on the style how to append new tests to test_email.py, I just put it into the same spot where all the other formataddr() tests where, shall I put it to the end instead? I am submitting this patch as part of my preparation for the Google Summer of Code to familiarize myself with the contribution process, any feedback on what I should do different is very welcome. -- keywords: +patch nosy: +torsten.becker Added file: http://bugs.python.org/file21429/issue-1690608.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1690608 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1690608] email.utils.formataddr() should be rfc2047 aware
Torsten Becker torsten.bec...@gmail.com added the comment: However, there should be a test for that, and I'm curious to know what happens if you use such an address in an address field in the unmodified email package. I added a test to check if the exceptions get thrown when a address is invalid. I also added a small test to check how a resulting message should look, it looks good to me but I am not a specialist with email. Do you have any other ideas how to check if it does not have a negative impact to other parts of the module? Instead of directly calling bencode, you should use the charset module and its header_encode method. Note that you need to turn the charset into a Charset instance first. The advantage of doing this is that it will choose the best encoding to use based on the charset and the contents of the string. The code also uses email.charset.Charset now. -- Added file: http://bugs.python.org/file21431/issue-1690608-v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1690608 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1690608] email.utils.formataddr() should be rfc2047 aware
Torsten Becker torsten.bec...@gmail.com added the comment: I incorporated the changes as you suggested and added the text to the docs. Just out of curiosity, why are the docs repeated in email.util.rst when they are already in the docstrings? -- Added file: http://bugs.python.org/file21434/issue-1690608-v3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1690608 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com