[issue33104] Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6
Eric Appelt <eric.app...@gmail.com> added the comment: Yes, thanks. I failed to see the duplicate searching for open issues, closing. -- resolution: -> duplicate stage: -> resolved status: open -> closed ___ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue33104> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue32625] Update the dis module documentation to reflect switch to wordcode
Change by Eric Appelt <eric.app...@gmail.com>: -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue32625> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue33104] Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6
New submission from Eric Appelt <eric.app...@gmail.com>: The documentation for the EXTENDED_ARG instruction in the dis module documentation refers to the way the opcode worked before 3.6: https://docs.python.org/3.6/library/dis.html#opcode-EXTENDED_ARG As I understand, since moving to 2-byte wordcode in 3.6, each EXTENDED_ARG effectively adds a byte to the argument of the next instruction and they can be chained to allow up to a 32-bit argument. The current documentation refers the 2-byte arguments from the older bytecode used in 3.5 and below. I'm trying to think of a clear and concise wording for how it works now and will add a PR to fix this issue unless someone gets to it before me. -- assignee: docs@python components: Documentation messages: 314100 nosy: Eric Appelt, docs@python priority: normal severity: normal status: open title: Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6 versions: Python 3.6, Python 3.7, Python 3.8 ___ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue33104> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26701] Documentation for int constructor mentions __int__ but not __trunc__
Change by Eric Appelt <eric.app...@gmail.com>: -- keywords: +patch pull_requests: +5788 stage: -> patch review ___ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue26701> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30145] Create a How to or Tutorial documentation for asyncio
Eric Appelt added the comment: > 2b - I propose to take a simple protocol like Memcache or > Redis and simply implement it using the streams API. We’ll only > need two methods: set and get; and in the end we’ll teach the user > how things really work and how to design async APIs. I decided to make myself a test subject for this idea using Redis since although I use Redis a lot at work I've never looked at the underlying Redis protocol. I also haven't used the streams API in a long time, and never for anything other than simple exploration. I worked on this the other night and found the Redis protocol simple enough to understand and was able to write a simple and somewhat clumsy client the other night with a bit of work: https://gist.github.com/appeltel/09d77eb489494ae1e2703933108cb60a One thing that might be good about the Redis protocol for this purpose is that the parsing isn't completely trivial but it isn't overly complex either. For example, a response string can be given in either the simple format, which begins with a "+" byte and then terminates with "\r\n", or it can be in a binary safe format where a token of the form "$123\r\n" is sent first, signaling that the next 123 bytes are the result to be followed by another "\r\n" termination. There is also the sentinel value "$-1\r\n" which may signal an unset key. Thus the string "foo" might be sent as "+foo\r\n" or "$3\r\nfoo\r\n". So I think this can be reasonably described in a few brief paragraphs and it is much more interesting (to me) than an echo client/server example. The tutorial should probably also include a link to a rudimentary server implementation of an in-memory key/value store implementing the Redis protocol and supporting just GET and SET, as the reader may not be able to easily stand up Redis for a variety of reasons or lack of knowledge. I can experiment with memcached as well but I think this proposal is a good idea and would work well with Redis. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30145> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30145] Create a How to or Tutorial documentation for asyncio
Changes by Eric Appelt <eric.app...@gmail.com>: -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30145> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30117] test_lib2to3.test_parser.test_all_project_files() fails
Eric Appelt added the comment: I added a PR to fix these two (in my opinion) spurious failure conditions in the lib2to3.tests.test_parser.TestParserIdempotency test_parser test with the following changes to the test: 1. Use the same encoding found in the initial file to write a temp file for a diff. This retains the BOM if the encoding was initially utf-8-sig. 2. If the file cannot be parsed using the normal grammar, try again with no print statement which should succeed for valid files using future print_function For case (1), the driver was correctly handling a BOM in a utf-8 file, but then the test was not writing a comparison file using 'utf-8-sig' to diff against, so the BOM got removed. I don't think that is the fault of the parser, and lib2to3 will retain the BOM. For case (2), lib2to3 pre-detects the use of from __future__ import print_function or allows the user to force this interpretation with a -p flag, and then selects a different grammar with the print statement removed. That makes the test cases unfair to this test as the driver itself doesn't know which grammar to use. As a minimal fix, the test will try using a grammar with the print statement, and if that fails fall back on a grammar without it. A more thorough handling of the idempotency test would to be to parse all files using both grammars and ignore if one of the two failed but otherwise check both. I didn't think this was necessary but can change. -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30117> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30117] test_lib2to3.test_parser.test_all_project_files() fails
Changes by Eric Appelt <eric.app...@gmail.com>: -- pull_requests: +1360 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30117> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30096] Update examples in abc documentation to use abc.ABC
Eric Appelt added the comment: I created a PR to update the documentation to use this pattern and follow Raymond's suggestion of showing both ways to define an abc. In order to make the examples comprehensible when read from beginning to end, I reordered the classes so that abc.ABC is described first. -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30096> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30096] Update examples in abc documentation to use abc.ABC
Changes by Eric Appelt <eric.app...@gmail.com>: -- pull_requests: +1343 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30096> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29514] Add a test case that prevents magic number changes in minor releases
Eric Appelt added the comment: Thank you for the review and suggestions Nick, Serhiy, and all. I believe that I have implemented the suggestions here and on github into a new commit on my pull request which moves the test into an existing module and removes the notion of a table in favor of a single expected value. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29514> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29026] time.time() documentation should mention UTC timezone
Changes by Eric Appelt <eric.app...@gmail.com>: -- pull_requests: +91 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29026> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29514] Add a test case that prevents magic number changes in minor releases
Eric Appelt added the comment: I added PR #54 with a test as Nick describes for checking magic number changes and explaining to the developer the procedure to follow if this is required in a maintenance release. -- nosy: +Eric Appelt pull_requests: +49 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29514> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29026] time.time() documentation should mention UTC timezone
Eric Appelt added the comment: As we have moved to GitHub and mandatory reviews with Pull Requests, I have created a new patch in PR#34 which incorporates Victor's suggestions. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29026> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29026] time.time() documentation should mention UTC timezone
Eric Appelt added the comment: I had some checks performed on a Windows platform using the following snippet: # valid for 2016 import time def check(): t = time.gmtime() print(46*86400*365 + 11*86400 + (t.tm_yday-1)*86400 + t.tm_hour*3600 + t.tm_min*60 + t.tm_sec) print(time.time()) print(time.gmtime(0)) check() This ensures that the time since the epoch is counted excluding leap seconds if the first two lines of output are approximately the same (to nearest second), and finally that the epoch is the Unix epoch. On Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900 32 bit (Intel)] on win32 I see: 1482502941 1482502941.9609053 time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0) Unless there is major variation among windows versions on how FILETIMEs are calculated and the results of basic system calls, I feel fairly confident now that the calculation of time since the epoch in CPython is the same on Windows as it is in POSIX-compliant platforms. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29026> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29026] time.time() documentation should mention UTC timezone
Eric Appelt added the comment: I looked at the documentation and implementation and made a patch to hopefully clarify the issue. I also have some comments: - The term "epoch" is actually defined at the top of the page with instructions for how to determine it, i.e. run "gmtime(0)". - I added a definition for the term "seconds since the epoch" since this typically means the number of seconds that have elapsed since the epoch excluding leap seconds, although it may vary by platform. - I'm a bit uncomfortable with potential confusion regarding Windows, since the Windows epoch begins on 1601, although we adjust it to 1970: https://github.com/python/cpython/blob/d739274e7b69f63263054cc24684e7e637264350/Python/pytime.c#L536-L539 I didn't add a note in the patch as I am not a windows developer, but I wonder if there could be some confusion. - Even though its redundant with the top of the page, I added specific clarification to time.time() as it appears that most readers are going to navigate to that anchor specifically and not read the discussion of what we mean by "epoch" above. - I need to test my assertion that Windows does not count leap seconds. I'm pretty sure but not 100% confident that I am correct from what I read, but I need to find someone with a windows machine tomorrow and actually check it. The windows documentation for the FILETIME structure suggests that leap seconds might be counted - https://msdn.microsoft.com/en-us/library/windows/desktop/ms724284(v=vs.85).aspx - but I think this is just an oversight in the documentation. - Just out of personal curiousity does anyone have an example of a legacy UNIX system that counts leap seconds in the epoch? This seems tricky to do as I would presume that the OS would need some table that would be updated whenever the astronomers declare a new leap second. -- keywords: +patch Added file: http://bugs.python.org/file46005/29026_unixtime_v1.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29026> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29026] time.time() documentation should mention UTC timezone
Eric Appelt added the comment: I would be happy to work on a documentation patch for this - I'll try to have something up by tomorrow evening if no one beats me to it. -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29026> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: Thanks for the explanations and example code in the review! Working on this issue is really helping to improve my understanding of a number of tricky things like finalization. I *think* that I have all the comments implemented in the attached patch. I also removed a spurious 2-line fragment at the top of the Asynchronous generator-iterator methods section that I had neglected to delete and escaped review. -- Added file: http://bugs.python.org/file45903/pep525_pep530_docs_v2.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: I believe that I addressed all the comments in the previous review (although its always possible I missed something), and have a new patch with the improvements and fixes. I also noticed that in asyncio, loop.shutdown_asyncgens() is a coroutinemethod and fixed the markup to reflect that as I previously labeled it a method. The example I added here is as suggested, but it differs from the example in PEP525 in that shutdown_asyncgens() is run in a finally clause, which makes sense to me. When applying the comments to sys.set_asyncgen_hooks() I verified that the keywords are optional but it is apparently not keyword only. I couldn't find many good examples of referencing source, so you may want to closely check how I linked to the shutdown_asyncgens() implementation. -- Added file: http://bugs.python.org/file45868/pep525_pep530_docs_v1.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: Yes - I'll work on the patch tonight. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: I think this needs considerable checking and polishing, but I am submitting this "Work In Progress" patch for the PEP525 and PEP530 documentation so that I can get some general feedback since I am new to this, and to ensure that if this is not generally what was needed or expected I don't delay the effort. I tried to start where possible by using the wording in corresponding generator sections, so this may err on the side of redundant phrasing, which I can change. I had some difficulty with the following points, and am not sure if they merit opening other tickets: 1. In PEP525 the documentation for aclose() is a bit terse and unclear to me. It appeared to suggest that you could catch GeneratorExit and yield, but I found this to result in a RuntimeError like a normal generator. I tried to document this as it actually behaves. 2. One thing that I noticed documented about normal generators is that they raise a ValueError if you try to run send() while another send() call is currently running. I verified this using threads. I looked into corresponding behavior for asynchronous generators, calling asend(), running the awaitable halfway through, and then calling asend() again to get a second awaitable before the first one finished. Asyncio seems to prevent more than one awaitable from a single async generator running at the same time, but I couldn't figure out how. Running some coroutines "by hand" calling asend() and send(), I was permitted to run multiple awaitables concurrently which produced odd results. I did not attempt to document the behavior in (2) in the submitted patch. -- keywords: +patch Added file: http://bugs.python.org/file45809/pep525_pep530_docs_WIP.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28910] Async generator does not raise RuntimeError if finalizer not set
New submission from Eric Appelt: While testing my understanding in order to document PEP525 asynchronous generators in the Language Reference (issue 28091) I noticed that the implemented behavior deviates from PEP525, specifically the PEP states that: "When an asynchronous generator is iterated for the first time, it stores a reference to the current finalizer. If there is none, a RuntimeError is raised. This provides a strong guarantee that every asynchronous generator object will always have a finalizer installed by the correct event loop." I created an asynchronous generator to try to run calling __anext__ interactively without an event loop to check the behavior, and was surprised when I didn't get a RuntimeError even though I had not called sys.set_asyncgen_hooks() to set a finalizer function. I looked at the function async_gen_init_hooks defined in Objects/genobject.c and it appears that if the thread state async_gen_finalizer is NULL, then it just returns zero and allows the calling function to continue. I'm not sure if this is a bug, or this is intentional and the finalizer mechanism should be optional. If it is the latter should PEP525 get updated? -- components: Interpreter Core messages: 282737 nosy: Eric Appelt, yselivanov priority: normal severity: normal status: open title: Async generator does not raise RuntimeError if finalizer not set type: behavior versions: Python 3.6, Python 3.7 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28910> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: Hi Yury - Yes, I am excited about doing this and getting to work on it. Today was pretty busy but I have some time blocked out tomorrow to really dig in. I was planning to give myself a deadline of Friday afternoon if that works for the review/release schedule. If needed I can try to get it up earlier. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28091] Document PEP 525
Eric Appelt added the comment: Hi, I'm a new contributor trying to get started with documentation and testing. I have been following and testing PEP525 and PEP530 and feel familiar enough to try to work on adding this to the Python Language Reference if it would not be redundant with work already being completed. I think this makes sense to tackle along with issue 28090 as it appears that documentation of asynchronous generator expressions would necessarily refer to terminology from asynchronous generator objects. Specifically, for PEP525: async generator functions would be added first in section 3.2 after coroutines. Most of the work would be in expanding 6.2.9 (yield expressions) to discuss the asynchronous case, probably adding 6.2.9.3(4) to document the methods for asynchronous generator objects. In section 8.8.1 I would need to remove the statement “it is a syntax error to use yield expressions in async def coroutines”, add statement that “yield from” is a syntax error in this context. Finally, the new concepts should be added to the Glossary. Then for PEP530: Section 6.2.4 (Displays for Lists,...) needs to be updated for the grammar change that adds the optional ASYNC, i.e.: comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter] Then both sections 6.2.4 and 6.2.8 (Generator expressions) need to be expanded to include the async case. If writing this up is not already underway, and if I’m generally on the right track for what needs to be done, I think I can have a reasonable patch for this by the end of the week. Also, please feel free to contact me if there is any other documentation/testing/test coverage work that I could help with. The new native coroutines are an area of particular interest for me. -- nosy: +Eric Appelt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28091> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: Here are my test results - I'll re-describe the issue and test so people don't have to read through all the previous text and for completeness. - The basic test case ("letter test"): Consider the set of the first seven letters - {'a', 'b', 'c', 'd', 'e', 'f', 'g'}. If one constructs frozensets of all 128 possible subsets of this set, and computes their hashes, in principle they should have good collision statistics. To check, the hashes of all 128 frozensets are computed and the last 7 bits of each hash are compared. The quality of the hashing algorithm is indicated by how many unique values of the last 7 bits are present in the set of 128 hashes. Too few unique values and the collision statistics are poor. In the python testing suite, this particular test is run and if the number of unique values is 32 or less the test fails. Since the hash of a string depends on the value of the PYTHONHASHSEED which is random by default, and the hashes of frozensets are dependent on their entries, this test is not deterministic. My earlier tests show that this will fail for one out of every few thousand seeds. -- The implementation issue and proposed fix: The hash of a frozen set is computed by taking the hash of each element, shuffling the bits in a deterministic way, and then XORing it into a hash for the frozenset (starting with zero). The size of the frozenset is then also shuffled and XORed into that result. Finally, in order to try to remove patterns incurred by XORing similar combinations of elements as with nested sets, the resulting hash is sent through a simple LCG to arrive at a final value: hash = hash * 69069U + 907133923UL; The hypothesis is that this LCG is not effective in improving collision rates for this particular test case. As an alternative, the proposal is to take the result of the XORs, and run that through the hashing algorithm set in the compiled python executable for hashing bytes (i.e. FNV or SipHash24). This rehashing may do a better job of removing patterns set up by combinations of common elements. - The experiment: To more robustly check the properties of the frozenset hashing algorithm, the letter test is run many times setting different PYTHONHASHSEED values from 0 to 1. This produces a distribution of unique sequences (u) of the last 7 hash bits in the set of 128 frozensets. To compare against the optimal case, the same distribution (u) is produced from a set of pseudorandom integers generated with "random.randint(0, 2**64)". Which is found to have be a normal distribution with a mean of ~81 and standard deviation of ~3.5. Six different test cases are considered and the results are shown in the attached figure (figure1.png) - Compile using the FNV algorithm. For control, do not shuffle the XORed result to compute the frozenset hash. (Fig 1-a) - Compile using the FNV algorithm. For the current implementation, shuffle the XORed result using the LCG to compute the frozenset hash. (Fig 1-b) - Compile using the FNV algorithm. For the current implementation, shuffle the XORed result using the FNV algorithm to compute the frozenset hash. (Fig 1-c) - Compile using the SipHash24 algorithm. For control, do not shuffle the XORed result to compute the frozenset hash. (Fig 1-d) - Compile using the SipHash24 algorithm. For the current implementation, shuffle the XORed result using the LCG to compute the frozenset hash. (Fig 1-e) - Compile using the SipHash24 algorithm. For the current implementation, shuffle the XORed result using the SipHash24 algorithm to compute the frozenset hash. (Fig 1-f) Results: Using the LCG to shuffle the XORed result of the entry and size hashes to finish computing the frozenset hash did not improve the results of the letter test, and appeared to have no effect on the distribution at all. Rehashing with the configured algorithm to shuffle the XORed result of the entry and size hashes to finish computing the frozenset hash did not improve the results of the letter test, and appeared to have no effect on the distribution at all. The FNV results were odd in that specific outlying values of u were often repeated for different seeds, such as 45 and 104. There was no apparent periodic behavior in these repeated outlying results. -- Added file: http://bugs.python.org/file45306/fig1.png ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: I made a copy/paste error on the second to last paragraph of the previous comment, it should be: Rehashing with the configured algorithm to shuffle the XORed result of the entry and size hashes to finish computing the frozenset hash resulted in an ideal distribution matching that of the pseudorandom numbers. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: If I understand the reporting properly all tests so far have used SipHash24: Python 3.7.0a0 (default:5b33829badcc+, Oct 30 2016, 17:29:47) [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sysconfig >>> sysconfig.get_config_var("Py_HASH_ALGORITHM") 0 >>> import sys >>> sys.hash_info.algorithm 'siphash24' It sounds like it is worth it for me to be more rigorous and perform a battery of tests using FNV and then SipHash24 to compare: - Performing no dispersion after the frozenset hash is initially computed from XORing entry hashes (control) - Performing dispersion using an LCG after the frozenset hash is initially computed from XORing entry hashes (current approach) - Performing dispersion using the selected hash algorithm (FNV/SipHash24) after the frozenset hash is initially computed from XORing entry hashes (proposed approach) I'll take the six plots and merge them into a single PNG, and also post my (short)testing and plotting scripts for reproducibility and checking of the results. I can also write a regression test if you think that would be good to have in the test suite (perhaps skipped by default for time), where instead of using the same seven letters a-g as test strings and varying PYTHONHASHSEED, I could perform the letter test for n=7 with 1 different sets of short random strings to see if any fell below threshold. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: Ugh... so I think I made a mental error (i.e. confused myself) in the last comment. The result looked a bit "to good to be true" and I think that there is at least one error in my thinking about the problem. I tried testing with the width set to 2 and then 1 as a check and noticed that even without "widening" the problem was still fixed and the hash distributions matched that of pseudorandom numbers. It turns out that just running the XORed result of the shuffled entry hashes through the hashing algorithm gets rid of any bit patterns picked up through the process. Currently there is an LCG that is used to disperse patterns but I don't think it really helps the problems arising in these tests: hash = hash * 69069U + 907133923UL; I'm not attaching any more plots as the result can be described in words, but when the LCG applied to the hash after XORing all entries is replaced with a hashing of the result using the standard python hashing algorithm, the test problem goes away. Specifically, when 128 distinct sets are hashed, the resulting hashes have a distribution of unique values across their last 7 digits matches the distribution from pseudorandom numbers. The fix is implemented in a small patch that I have attached. -- keywords: +patch Added file: http://bugs.python.org/file45283/setobject.c.2.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: I spent some time looking at the Objects/setobject.c code where the hashing scheme for frozensets, which essentially works by XOR-ing together the hashes from all its entries. Significant care is taken to shuffle bits at various stages, and the algorithm seems to be well thought out. My own attempts to change constants or add in reshufflings either didn't help the collision statistics or just made things worse. I think that there is something of a fundamental limitation here due to information loss in XOR, and other fast, commutative bitwise operations (i.e. AND, OR) are well known to be much worse. I did stumble on a 2006 paper by Boneh and Boyen [1] which looked at a potentially related problem of trying to combine two different hashing functions to improve collision resistance and found that there was no way to do this with fewer bits than just concatenating the hashes. The present ticket is more a problem of combining hashes from the same function in an order-independent way and may be completely unrelated. However, I imagine that concatenation followed by rehashing the result would remove the loss due to XORing unlucky choices of hashes. Concatenating everything seemed obviously too slow and/or unacceptable in terms of memory use, but I thought of a compromise where I would construct an array of 4 hash values, initialized to zero, and for each entry I would select an array index based on a reshuffling of the bits, and XOR that particular entry into the chosen index. This results in a hash that is 4x wider than the standard size that reduces the information loss incurred from XOR. This wide hash can then be hashed down to a normal sized hash. I implemented this algorithm and tested it using the same procedure as before. Specifically, all frozensets for every possible combination (128) of the letters abcdefg as single character strings are hashed, and the last 7 bits of each of their hashes are compared to see how many unique 7-bit patterns are produced. This is done for PYTHONHASHSEED values from 1 to 1 to build a distribution. The distribution is compared to a similar distribution constructed from pseudorandom numbers for comparison. Unlike the current hashing algorithm for frozensets, and much like the result from hashes of strings, the result of this new "wide4" algorithm appears to be nearly ideal. The results are plotted in "frozenset_string_n7_10k_wide4.png" as attached. I will follow this up with a patch for the algorithm as soon as I run the complete test suite and clean up. Another option if the current algorithm is considered good enough is to alter the current test to retry on failure if the power set of letters 'abcdefg...' fails by shifting one letter and trying again, perhaps ensuring that 4/5 tests pass. This ought to greatly reduce the sporadic build failures. [1] http://ai.stanford.edu/~xb/crypto06b/blackboxhash.pdf -- Added file: http://bugs.python.org/file45281/frozenset_string_n7_10k_wide4.png ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: I also looked at hashes of strings themselves rather than frozensets to check the hashing of strings directly. For example, n=3: ['', 'a', 'b', 'c', 'ab', 'ac', 'bc', 'abc'] rather than: [frozenset(), frozenset({'a'}), frozenset({'b'}), frozenset({'c'}), frozenset({'b', 'a'}), frozenset({'c', 'a'}), frozenset({'b', 'c'}), frozenset({'b', 'a', 'c'})] I made a distribution as with the last comment but now using the # of unique last-7 bit sequences in a set of 128 such strings (n=7) and compared to pseudorandom integers, just as was done before with frozensets of the letter combinations. This is shown in the file "str_string_n7_10k.png". The last 7-bits of the small string hashes produce a distribution much like regular pseudorandom integers. So if there is a problem with the hash algorithm, it appears to be related to the frozenset hashing and not strings. -- Added file: http://bugs.python.org/file45270/str_string_n7_10k.png ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)
Eric Appelt added the comment: I dug into this failure a little bit and verified that it is specifically the "letter_range" portion of the test that sporadically fails. The hash of any frozenset constructed from floats, ints, or the empty frozenset, as well as frozensets recursively containing any of the previous have deterministic hashes that don't vary with the seed. I isolated the letter_range test for various values of n to see how often this failure generally happened. I scanned the first 1 integers set to PYTHONHASHSEED and got the following failures: n=2 - n=3 - n=4 300, 1308, 2453, 4196, 5693, 8280, 8353 n=5 4846, 5693 n=6 3974 n=7 36, 1722, 5064, 8562, 8729 n=8 2889, 5916, 5986 n=9 - n=10 - I checked to see the behavior of psuedorandom integers in the range 0 to 2**64-1 by making a large sample of values taken from "len({random.randint(0,2**64) & 127 for _ in range(128)})", and found that the value of "u" in the test for n=7 if the hashes really are effectively randomly distributed follows a gaussian distribution with a mean of ~81 and deviation of ~3.5. So a value of 31 would be nearly 14 deviations from the mean which seems quite unreasonable. I then took the distribution of the set sizes from the letter_range test for n=7 with 10,000 different seeds and plotted it alongside the distribution of set sizes from the last 7 bits of pseudorandom numbers in the attached file "frozenset_string_n7_10k.png". The hashes of the frozensets of single letters follows a very different distribution. Either this test is inappropriate and will cause sporadic build failures, or there is a problem with the hash algorithm. -- nosy: +Eric Appelt Added file: http://bugs.python.org/file45269/frozenset_string_n7_10k.png ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26163> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28541] Improve test coverage for json library - loading bytes
Eric Appelt added the comment: I was able to inspect the review, and implemented your suggestions. This is attached as a new patch. Please let me know if this is not the correct workflow. Thank you for the prompt review and help as I learn the python tracking and review process! -- Added file: http://bugs.python.org/file45255/mywork4.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28541> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28541] Improve test coverage for json library - loading bytes
Eric Appelt added the comment: Thanks for the feedback. I agree that the comment is incorrect for several iterations of the loop that really don't need to be tested at all for '5'. I read the previous issue 17909 more carefully along with RFC 4627, 7159, and EMCA 404 to properly understand the context here. I notice that although the issue 17909 patch was written in order to implement the patterns in RFC 4627, it is also valid for the additional pattern introduced for utf-16-le in RFC 7159 as described in [1]. Specifically, in RFC 4627 the only valid pattern for utf-16-le begins with XX 00 XX 00 since the first two characters must represent ASCII codepoints. With strings the second character can be higher codepoints allowing for XX 00 XX XX or XX 00 00 XX. In the implementation from issue 17909 the pattern XX 00 XX is first detected and results in utf-16-le, and then the 4th byte is checked for the pattern XX 00 00 XX or XX 00 00 00 to result in utf-16-le or utf-32 respectively. In the issue 17909 patch the special case of a single character JSON document (i.e. '5') is also specifically accounted for. This case is not mentioned in [1]. So for everything I can try (or think of), this implementation can correctly determine the encoding of any valid JSON document according to RFC 7159. This is good since the documentation [2] makes no distinction that json.loads() would only accept JSON documents in bytes if they adhere to 4627. The encoding failure mode described in issue 17909 is still an invalid JSON document according to RFC 7159 as the control character \u is not escaped. So I think overall the implementation is perfectly valid for RFC 7159 / EMCA 404 as suggested by the standard library documentation [2]. To me it seems reasonable to have a unit test to specifically check that the pattern XX 00 00 XX works. For the goal of hitting 100% test coverage as measured by line, I believe that the 2-byte case still needs to be tested. I have attached a patch that covers these cases along with (maybe too) verbose comments explaining the edge cases. I also changed the comments in the json/__init__.py module itself to clarify the detection patterns as implemented. I'm not sure how helpful this is to the improvement in of the standard library, but it has been very educational to be so far. [1] http://www.ietf.org/mail-archive/web/json/current/msg01959.html [2] https://docs.python.org/3.7/library/json.html -- Added file: http://bugs.python.org/file45246/mywork3.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28541> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28541] Improve test coverage for json library - loading bytes
Eric Appelt added the comment: I looked back and something is clearly wrong with my coverage reporting setup, sorry :( When I move the test introduced in issue 17909, 'test_bytes_decode' from the module Lib/test/test_json/test_unicode.py to Lib/test/test_json/test_decode.py that particular test is properly traced and I see that the function is mostly covered, so I need to figure out what is going wrong in my setup. The test already present is more comprehensive as it includes BOM, so I believe what I wrote is generally not helpful. I did notice that the existing test does not cover the edge-case of a 2-byte utf-16 sequence that can be generated for a valid JSON object represented by a single codepoint, for example '5' or '7'. I created a new patch to augment the existing test to cover this special case rather than adding a test. -- Added file: http://bugs.python.org/file45240/mywork2.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28541> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28541] Improve test coverage for json library - loading bytes
New submission from Eric Appelt: Increase test coverage of the json library, specifically the detect_encoding() function in the __init__ module, which is used to handle the autodetection of the encoding of a bytes object passed to json.loads(). This function was added in issue 17909 which extended the json.loads() function to accept bytes. Note that this is a small patch as I am following section 5 of the developer guide and I am trying to acquaint myself with the process as a first time contributor. I found this missing coverage just by semi-randomly looking at individual modules of interest. Please let me know if I have made any general mistakes in constructing this ticket. Thanks! -- components: Tests files: mywork.patch keywords: patch messages: 279521 nosy: Eric Appelt priority: normal severity: normal status: open title: Improve test coverage for json library - loading bytes versions: Python 3.6 Added file: http://bugs.python.org/file45234/mywork.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28541> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com