[issue33104] Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6

2018-03-19 Thread Eric Appelt

Eric Appelt <eric.app...@gmail.com> added the comment:

Yes, thanks. I failed to see the duplicate searching for open issues, closing.

--
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33104>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32625] Update the dis module documentation to reflect switch to wordcode

2018-03-19 Thread Eric Appelt

Change by Eric Appelt <eric.app...@gmail.com>:


--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue32625>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33104] Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6

2018-03-19 Thread Eric Appelt

New submission from Eric Appelt <eric.app...@gmail.com>:

The documentation for the EXTENDED_ARG instruction in the dis module 
documentation refers to the way the opcode worked before 3.6: 
https://docs.python.org/3.6/library/dis.html#opcode-EXTENDED_ARG

As I understand, since moving to 2-byte wordcode in 3.6, each EXTENDED_ARG 
effectively adds a byte to the argument of the next instruction and they can be 
chained to allow up to a 32-bit argument. The current documentation refers the 
2-byte arguments from the older bytecode used in 3.5 and below.

I'm trying to think of a clear and concise wording for how it works now and 
will add a PR to fix this issue unless someone gets to it before me.

--
assignee: docs@python
components: Documentation
messages: 314100
nosy: Eric Appelt, docs@python
priority: normal
severity: normal
status: open
title: Documentation for EXTENDED_ARG in dis module is incorrect for >=3.6
versions: Python 3.6, Python 3.7, Python 3.8

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33104>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26701] Documentation for int constructor mentions __int__ but not __trunc__

2018-03-07 Thread Eric Appelt

Change by Eric Appelt <eric.app...@gmail.com>:


--
keywords: +patch
pull_requests: +5788
stage:  -> patch review

___
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue26701>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30145] Create a How to or Tutorial documentation for asyncio

2017-05-31 Thread Eric Appelt

Eric Appelt added the comment:

> 2b - I propose to take a simple protocol like Memcache or 
> Redis and simply implement it using the streams API.  We’ll only
> need two methods: set and get; and in the end we’ll teach the user
> how things really work and how to design async APIs.

I decided to make myself a test subject for this idea using Redis since 
although I use Redis a lot at work I've never looked at the underlying Redis 
protocol. I also haven't used the streams API in a long time, and never for 
anything other than simple exploration.

I worked on this the other night and found the Redis protocol simple enough to 
understand and was able to write a simple and somewhat clumsy client the other 
night with a bit of work: 
https://gist.github.com/appeltel/09d77eb489494ae1e2703933108cb60a

One thing that might be good about the Redis protocol for this purpose is that 
the parsing isn't completely trivial but it isn't overly complex either. For 
example, a response string can be given in either the simple format, which 
begins with a "+" byte and then terminates with "\r\n", or it can be in a 
binary safe format where a token of the form "$123\r\n" is sent first, 
signaling that the next 123 bytes are the result to be followed by another 
"\r\n" termination. There is also the sentinel value "$-1\r\n" which may signal 
an unset key. Thus the string "foo" might be sent as "+foo\r\n" or 
"$3\r\nfoo\r\n".

So I think this can be reasonably described in a few brief paragraphs and it is 
much more interesting (to me) than an echo client/server example.

The tutorial should probably also include a link to a rudimentary server 
implementation of an in-memory key/value store implementing the Redis protocol 
and supporting just GET and SET, as the reader may not be able to easily stand 
up Redis for a variety of reasons or lack of knowledge.

I can experiment with memcached as well but I think this proposal is a good 
idea and would work well with Redis.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30145>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30145] Create a How to or Tutorial documentation for asyncio

2017-05-26 Thread Eric Appelt

Changes by Eric Appelt <eric.app...@gmail.com>:


--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30145>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30117] test_lib2to3.test_parser.test_all_project_files() fails

2017-04-21 Thread Eric Appelt

Eric Appelt added the comment:

I added a PR to fix these two (in my opinion) spurious failure conditions in 
the lib2to3.tests.test_parser.TestParserIdempotency test_parser test with the 
following changes to the test:

1. Use the same encoding found in the initial file to write a temp file for a 
diff. This retains the BOM if the encoding was initially utf-8-sig.

2. If the file cannot be parsed using the normal grammar, try again with no 
print statement which should succeed for valid files using future print_function

For case (1), the driver was correctly handling a BOM in a utf-8 file, but then 
the test was not writing a comparison file using 'utf-8-sig' to diff against, 
so the BOM got removed. I don't think that is the fault of the parser, and 
lib2to3 will retain the BOM.

For case (2), lib2to3 pre-detects the use of from __future__ import 
print_function or allows the user to force this interpretation with a -p flag, 
and then selects a different grammar with the print statement removed. That 
makes the test cases unfair to this test as the driver itself doesn't know 
which grammar to use. As a minimal fix, the test will try using a grammar with 
the print statement, and if that fails fall back on a grammar without it. A 
more thorough handling of the idempotency test would to be to parse all files 
using both grammars and ignore if one of the two failed but otherwise check 
both. I didn't think this was necessary but can change.

--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30117>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30117] test_lib2to3.test_parser.test_all_project_files() fails

2017-04-21 Thread Eric Appelt

Changes by Eric Appelt <eric.app...@gmail.com>:


--
pull_requests: +1360

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30117>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30096] Update examples in abc documentation to use abc.ABC

2017-04-20 Thread Eric Appelt

Eric Appelt added the comment:

I created a PR to update the documentation to use this pattern and follow 
Raymond's suggestion of showing both ways to define an abc.

In order to make the examples comprehensible when read from beginning to end, I 
reordered the classes so that abc.ABC is described first.

--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30096>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30096] Update examples in abc documentation to use abc.ABC

2017-04-20 Thread Eric Appelt

Changes by Eric Appelt <eric.app...@gmail.com>:


--
pull_requests: +1343

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30096>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29514] Add a test case that prevents magic number changes in minor releases

2017-03-11 Thread Eric Appelt

Eric Appelt added the comment:

Thank you for the review and suggestions Nick, Serhiy, and all. I believe that 
I have implemented the suggestions here and on github into a new commit on my 
pull request which moves the test into an existing module and removes the 
notion of a table in favor of a single expected value.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29514>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29026] time.time() documentation should mention UTC timezone

2017-02-16 Thread Eric Appelt

Changes by Eric Appelt <eric.app...@gmail.com>:


--
pull_requests: +91

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29026>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29514] Add a test case that prevents magic number changes in minor releases

2017-02-12 Thread Eric Appelt

Eric Appelt added the comment:

I added PR #54 with a test as Nick describes for checking magic number changes 
and explaining to the developer the procedure to follow if this is required in 
a maintenance release.

--
nosy: +Eric Appelt
pull_requests: +49

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29514>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29026] time.time() documentation should mention UTC timezone

2017-02-11 Thread Eric Appelt

Eric Appelt added the comment:

As we have moved to GitHub and mandatory reviews with Pull Requests, I have 
created a new patch in PR#34 which incorporates Victor's suggestions.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29026>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29026] time.time() documentation should mention UTC timezone

2016-12-23 Thread Eric Appelt

Eric Appelt added the comment:

I had some checks performed on a Windows platform using the following snippet:

# valid for 2016
import time
def check():
t = time.gmtime()
print(46*86400*365 + 11*86400 + (t.tm_yday-1)*86400 + t.tm_hour*3600 + 
t.tm_min*60 + t.tm_sec)
print(time.time())
print(time.gmtime(0))
check()

This ensures that the time since the epoch is counted excluding leap seconds if 
the first two lines of output are approximately the same (to nearest second), 
and finally that the epoch is the Unix epoch.

On Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900 32 bit 
(Intel)] on win32 I see:

1482502941
1482502941.9609053
time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, 
tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

Unless there is major variation among windows versions on how FILETIMEs are 
calculated and the results of basic system calls, I feel fairly confident now 
that the calculation of time since the epoch in CPython is the same on Windows 
as it is in POSIX-compliant platforms.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29026>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29026] time.time() documentation should mention UTC timezone

2016-12-22 Thread Eric Appelt

Eric Appelt added the comment:

I looked at the documentation and implementation and made a patch to hopefully 
clarify the issue. I also have some comments:

- The term "epoch" is actually defined at the top of the page with instructions 
for how to determine it, i.e. run "gmtime(0)".

- I added a definition for the term "seconds since the epoch" since this 
typically means the number of seconds that have elapsed since the epoch 
excluding leap seconds, although it may vary by platform.

- I'm a bit uncomfortable with potential confusion regarding Windows, since the 
Windows epoch begins on 1601, although we adjust it to 1970: 
https://github.com/python/cpython/blob/d739274e7b69f63263054cc24684e7e637264350/Python/pytime.c#L536-L539
 I didn't add a note in the patch as I am not a windows developer, but I wonder 
if there could be some confusion.

- Even though its redundant with the top of the page, I added specific 
clarification to time.time() as it appears that most readers are going to 
navigate to that anchor specifically and not read the discussion of what we 
mean by "epoch" above.

- I need to test my assertion that Windows does not count leap seconds. I'm 
pretty sure but not 100% confident that I am correct from what I read, but I 
need to find someone with a windows machine tomorrow and actually check it. The 
windows documentation for the FILETIME structure suggests that leap seconds 
might be counted - 
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724284(v=vs.85).aspx 
- but I think this is just an oversight in the documentation.

- Just out of personal curiousity does anyone have an example of a legacy UNIX 
system that counts leap seconds in the epoch? This seems tricky to do as I 
would presume that the OS would need some table that would be updated whenever 
the astronomers declare a new leap second.

--
keywords: +patch
Added file: http://bugs.python.org/file46005/29026_unixtime_v1.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29026>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue29026] time.time() documentation should mention UTC timezone

2016-12-22 Thread Eric Appelt

Eric Appelt added the comment:

I would be happy to work on a documentation patch for this - I'll try to have 
something up by tomorrow evening if no one beats me to it.

--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29026>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-14 Thread Eric Appelt

Eric Appelt added the comment:

Thanks for the explanations and example code in the review! Working on this 
issue is really helping to improve my understanding of a number of tricky 
things like finalization.

I *think* that I have all the comments implemented in the attached patch.

I also removed a spurious 2-line fragment at the top of the Asynchronous 
generator-iterator methods section that I had neglected to delete and escaped 
review.

--
Added file: http://bugs.python.org/file45903/pep525_pep530_docs_v2.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-12 Thread Eric Appelt

Eric Appelt added the comment:

I believe that I addressed all the comments in the previous review (although 
its always possible I missed something), and have a new patch with the 
improvements and fixes.

I also noticed that in asyncio, loop.shutdown_asyncgens() is a coroutinemethod 
and fixed the markup to reflect that as I previously labeled it a method. The 
example I added here is as suggested, but it differs from the example in PEP525 
in that shutdown_asyncgens() is run in a finally clause, which makes sense to 
me.

When applying the comments to sys.set_asyncgen_hooks() I verified that the 
keywords are optional but it is apparently not keyword only. I couldn't find 
many good examples of referencing source, so you may want to closely check how 
I linked to the shutdown_asyncgens() implementation.

--
Added file: http://bugs.python.org/file45868/pep525_pep530_docs_v1.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-12 Thread Eric Appelt

Eric Appelt added the comment:

Yes - I'll work on the patch tonight.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-08 Thread Eric Appelt

Eric Appelt added the comment:

I think this needs considerable checking and polishing, but I am submitting 
this "Work In Progress" patch for the PEP525 and PEP530 documentation so that I 
can get some general feedback since I am new to this, and to ensure that if 
this is not generally what was needed or expected I don't delay the effort.

I tried to start where possible by using the wording in corresponding generator 
sections, so this may err on the side of redundant phrasing, which I can change.

I had some difficulty with the following points, and am not sure if they merit 
opening other tickets:

1. In PEP525 the documentation for aclose() is a bit terse and unclear to me. 
It appeared to suggest that you could catch GeneratorExit and yield, but I 
found this to result in a RuntimeError like a normal generator. I tried to 
document this as it actually behaves.

2. One thing that I noticed documented about normal generators is that they 
raise a ValueError if you try to run send() while another send() call is 
currently running. I verified this using threads. I looked into corresponding 
behavior for asynchronous generators, calling asend(), running the awaitable 
halfway through, and then calling asend() again to get a second awaitable 
before the first one finished. Asyncio seems to prevent more than one awaitable 
from a single async generator running at the same time, but I couldn't figure 
out how. Running some coroutines "by hand" calling asend() and send(), I was 
permitted to run multiple awaitables concurrently which produced odd results.

I did not attempt to document the behavior in (2) in the submitted patch.

--
keywords: +patch
Added file: http://bugs.python.org/file45809/pep525_pep530_docs_WIP.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28910] Async generator does not raise RuntimeError if finalizer not set

2016-12-08 Thread Eric Appelt

New submission from Eric Appelt:

While testing my understanding in order to document PEP525 asynchronous 
generators in the Language Reference (issue 28091) I noticed that the 
implemented behavior deviates from PEP525, specifically the PEP states that:

"When an asynchronous generator is iterated for the first time, it stores a 
reference to the current finalizer. If there is none, a RuntimeError is raised. 
This provides a strong guarantee that every asynchronous generator object will 
always have a finalizer installed by the correct event loop."

I created an asynchronous generator to try to run calling __anext__ 
interactively without an event loop to check the behavior, and was surprised 
when I didn't get a RuntimeError even though I had not called 
sys.set_asyncgen_hooks() to set a finalizer function.

I looked at the function async_gen_init_hooks defined in Objects/genobject.c 
and it appears that if the thread state async_gen_finalizer is NULL, then it 
just returns zero and allows the calling function to continue.

I'm not sure if this is a bug, or this is intentional and the finalizer 
mechanism should be optional. If it is the latter should PEP525 get updated?

--
components: Interpreter Core
messages: 282737
nosy: Eric Appelt, yselivanov
priority: normal
severity: normal
status: open
title: Async generator does not raise RuntimeError if finalizer not set
type: behavior
versions: Python 3.6, Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28910>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-06 Thread Eric Appelt

Eric Appelt added the comment:

Hi Yury - Yes, I am excited about doing this and getting to work on it. Today 
was pretty busy but I have some time blocked out tomorrow to really dig in. I 
was planning to give myself a deadline of Friday afternoon if that works for 
the review/release schedule. If needed I can try to get it up earlier.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28091] Document PEP 525

2016-12-05 Thread Eric Appelt

Eric Appelt added the comment:

Hi, I'm a new contributor trying to get started with documentation and testing.

I have been following and testing PEP525 and PEP530 and feel familiar enough to 
try to work on adding this to the Python Language Reference if it would not be 
redundant with work already being completed. I think this makes sense to tackle 
along with issue 28090 as it appears that documentation of asynchronous 
generator expressions would necessarily refer to terminology from asynchronous 
generator objects. 

Specifically, for PEP525:

async generator functions would be added first in section 3.2 after coroutines.

Most of the work would be in expanding 6.2.9 (yield expressions) to discuss the 
asynchronous case, probably adding 6.2.9.3(4) to document the methods for 
asynchronous generator objects.

In section 8.8.1 I would need to remove the statement “it is a syntax error to 
use yield expressions in async def coroutines”, add statement that “yield from” 
is a syntax error in this context.

Finally, the new concepts should be added to the Glossary.

Then for PEP530:

Section 6.2.4 (Displays for Lists,...) needs to be updated for the grammar 
change that adds the optional ASYNC, i.e.:

comp_for: [ASYNC] 'for' exprlist 'in' or_test [comp_iter]

Then both sections 6.2.4 and 6.2.8  (Generator expressions) need to be expanded 
to include the async case.

If writing this up is not already underway, and if I’m generally on the right 
track for what needs to be done, I think I can have a reasonable patch for this 
by the end of the week. Also, please feel free to contact me if there is any 
other documentation/testing/test coverage work that I could help with. The new 
native coroutines are an area of particular interest for me.

--
nosy: +Eric Appelt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28091>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-11-01 Thread Eric Appelt

Eric Appelt added the comment:

Here are my test results - I'll re-describe the issue and test so people don't 
have to read through all the previous text and for completeness.

-

The basic test case ("letter test"):

Consider the set of the first seven letters - {'a', 'b', 'c', 'd', 'e', 'f', 
'g'}. If one constructs frozensets of all 128 possible subsets of this set, and 
computes their hashes, in principle they should have good collision statistics.

To check, the hashes of all 128 frozensets are computed and the last 7 bits of 
each hash are compared. The quality of the hashing algorithm is indicated by 
how many unique values of the last 7 bits are present in the set of 128 hashes. 
Too few unique values and the collision statistics are poor.

In the python testing suite, this particular test is run and if the number of 
unique values is 32 or less the test fails. Since the hash of a string depends 
on the value of the PYTHONHASHSEED which is random by default, and the hashes 
of frozensets are dependent on their entries, this test is not deterministic. 
My earlier tests show that this will fail for one out of every few thousand 
seeds.

--

The implementation issue and proposed fix:

The hash of a frozen set is computed by taking the hash of each element, 
shuffling the bits in a deterministic way, and then XORing it into a hash for 
the frozenset (starting with zero). The size of the frozenset is then also 
shuffled and XORed into that result.

Finally, in order to try to remove patterns incurred by XORing similar 
combinations of elements as with nested sets, the resulting hash is sent 
through a simple LCG to arrive at a final value:

hash = hash * 69069U + 907133923UL;

The hypothesis is that this LCG is not effective in improving collision rates 
for this particular test case. As an alternative, the proposal is to take the 
result of the XORs, and run that through the hashing algorithm set in the 
compiled python executable for hashing bytes (i.e. FNV or SipHash24). This 
rehashing may do a better job of removing patterns set up by combinations of 
common elements.

-

The experiment:

To more robustly check the properties of the frozenset hashing algorithm, the 
letter test is run many times setting different PYTHONHASHSEED values from 0 to 
1. This produces a distribution of unique sequences (u) of the last 7 hash 
bits in the set of 128 frozensets. 

To compare against the optimal case, the same distribution (u) is produced from 
a set of pseudorandom integers generated with "random.randint(0, 2**64)". Which 
is found to have be a normal distribution with a mean of ~81 and standard 
deviation of ~3.5.

Six different test cases are considered and the results are shown in the 
attached figure (figure1.png)

- Compile using the FNV algorithm. For control, do not shuffle the XORed result 
to compute the frozenset hash. (Fig 1-a)
- Compile using the FNV algorithm. For the current implementation, shuffle the 
XORed result using the LCG to compute the frozenset hash. (Fig 1-b)
- Compile using the FNV algorithm. For the current implementation, shuffle the 
XORed result using the FNV algorithm to compute the frozenset hash. (Fig 1-c)
- Compile using the SipHash24 algorithm. For control, do not shuffle the XORed 
result to compute the frozenset hash. (Fig 1-d)
- Compile using the SipHash24 algorithm. For the current implementation, 
shuffle the XORed result using the LCG to compute the frozenset hash. (Fig 1-e)
- Compile using the SipHash24 algorithm. For the current implementation, 
shuffle the XORed result using the SipHash24 algorithm to compute the frozenset 
hash. (Fig 1-f)



Results:

Using the LCG to shuffle the XORed result of the entry and size hashes to 
finish computing the frozenset hash did not improve the results of the letter 
test, and appeared to have no effect on the distribution at all.

Rehashing with the configured algorithm to shuffle the XORed result of the 
entry and size hashes to finish computing the frozenset hash did not improve 
the results of the letter test, and appeared to have no effect on the 
distribution at all.

The FNV results were odd in that specific outlying values of u were often 
repeated for different seeds, such as 45 and 104. There was no apparent 
periodic behavior in these repeated outlying results.

--
Added file: http://bugs.python.org/file45306/fig1.png

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-11-01 Thread Eric Appelt

Eric Appelt added the comment:

I made a copy/paste error on the second to last paragraph of the previous 
comment, it should be:

Rehashing with the configured algorithm to shuffle the XORed result of the 
entry and size hashes to finish computing the frozenset hash resulted in an 
ideal distribution matching that of the pseudorandom numbers.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-10-31 Thread Eric Appelt

Eric Appelt added the comment:

If I understand the reporting properly all tests so far have used SipHash24:

Python 3.7.0a0 (default:5b33829badcc+, Oct 30 2016, 17:29:47) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sysconfig
>>> sysconfig.get_config_var("Py_HASH_ALGORITHM")
0
>>> import sys
>>> sys.hash_info.algorithm
'siphash24'

It sounds like it is worth it for me to be more rigorous and perform a battery 
of tests using FNV and then SipHash24 to compare:

- Performing no dispersion after the frozenset hash is initially computed from 
XORing entry hashes (control)
- Performing dispersion using an LCG after the frozenset hash is initially 
computed from XORing entry hashes (current approach)
- Performing dispersion using the selected hash algorithm (FNV/SipHash24) after 
the frozenset hash is initially computed from XORing entry hashes (proposed 
approach)

I'll take the six plots and merge them into a single PNG, and also post my 
(short)testing and plotting scripts for reproducibility and checking of the 
results.

I can also write a regression test if you think that would be good to have in 
the test suite (perhaps skipped by default for time), where instead of using 
the same seven letters a-g as test strings and varying PYTHONHASHSEED, I could 
perform the letter test for n=7 with 1 different sets of short random 
strings to see if any fell below threshold.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-10-30 Thread Eric Appelt

Eric Appelt added the comment:

Ugh... so I think I made a mental error (i.e. confused myself) in the last 
comment. The result looked a bit "to good to be true" and I think that there is 
at least one error in my thinking about the problem.

I tried testing with the width set to 2 and then 1 as a check and noticed that 
even without "widening" the problem was still fixed and the hash distributions 
matched that of pseudorandom numbers.

It turns out that just running the XORed result of the shuffled entry hashes 
through the hashing algorithm gets rid of any bit patterns picked up through 
the process. Currently there is an LCG that is used to disperse patterns but I 
don't think it really helps the problems arising in these tests:

hash = hash * 69069U + 907133923UL;

I'm not attaching any more plots as the result can be described in words, but 
when the LCG applied to the hash after XORing all entries is replaced with a 
hashing of the result using the standard python hashing algorithm, the test 
problem goes away. Specifically, when 128 distinct sets are hashed, the 
resulting hashes have a distribution of unique values across their last 7 
digits matches the distribution from pseudorandom numbers.

The fix is implemented in a small patch that I have attached.

--
keywords: +patch
Added file: http://bugs.python.org/file45283/setobject.c.2.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-10-30 Thread Eric Appelt

Eric Appelt added the comment:

I spent some time looking at the Objects/setobject.c code where the hashing 
scheme for frozensets, which essentially works by XOR-ing together the hashes 
from all its entries. Significant care is taken to shuffle bits at various 
stages, and the algorithm seems to be well thought out. My own attempts to 
change constants or add in reshufflings either didn't help the collision 
statistics or just made things worse. I think that there is something of a 
fundamental limitation here due to information loss in XOR, and other fast, 
commutative bitwise operations (i.e. AND, OR) are well known to be much worse.

I did stumble on a 2006 paper by Boneh and Boyen [1] which looked at a 
potentially related problem of trying to combine two different hashing 
functions to improve collision resistance and found that there was no way to do 
this with fewer bits than just concatenating the hashes. The present ticket is 
more a problem of combining hashes from the same function in an 
order-independent way and may be completely unrelated. However, I imagine that 
concatenation followed by rehashing the result would remove the loss due to 
XORing unlucky choices of hashes.

Concatenating everything seemed obviously too slow and/or unacceptable in terms 
of memory use, but I thought of a compromise where I would construct an array 
of 4 hash values, initialized to zero, and for each entry I would select an 
array index based on a reshuffling of the bits, and XOR that particular entry 
into the chosen index. This results in a hash that is 4x wider than the 
standard size that reduces the information loss incurred from XOR. This wide 
hash can then be hashed down to a normal sized hash.

I implemented this algorithm and tested it using the same procedure as before. 
Specifically, all frozensets for every possible combination (128) of the 
letters abcdefg as single character strings are hashed, and the last 7 bits of 
each of their hashes are compared to see how many unique 7-bit patterns are 
produced. This is done for PYTHONHASHSEED values from 1 to 1 to build a 
distribution. The distribution is compared to a similar distribution 
constructed from pseudorandom numbers for comparison.

Unlike the current hashing algorithm for frozensets, and much like the result 
from hashes of strings, the result of this new "wide4" algorithm appears to be 
nearly ideal. The results are plotted in "frozenset_string_n7_10k_wide4.png" as 
attached.

I will follow this up with a patch for the algorithm as soon as I run the 
complete test suite and clean up.

Another option if the current algorithm is considered good enough is to alter 
the current test to retry on failure if the power set of letters 'abcdefg...' 
fails by shifting one letter and trying again, perhaps ensuring that 4/5 tests 
pass. This ought to greatly reduce the sporadic build failures.

[1] http://ai.stanford.edu/~xb/crypto06b/blackboxhash.pdf

--
Added file: http://bugs.python.org/file45281/frozenset_string_n7_10k_wide4.png

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-10-29 Thread Eric Appelt

Eric Appelt added the comment:

I also looked at hashes of strings themselves rather than frozensets to check 
the hashing of strings directly.

For example, n=3:

['', 'a', 'b', 'c', 'ab', 'ac', 'bc', 'abc']

rather than:

[frozenset(), frozenset({'a'}), frozenset({'b'}), frozenset({'c'}), 
frozenset({'b', 'a'}), frozenset({'c', 'a'}), frozenset({'b', 'c'}), 
frozenset({'b', 'a', 'c'})]

I made a distribution as with the last comment but now using the # of unique 
last-7 bit sequences in a set of 128 such strings (n=7) and compared to 
pseudorandom integers, just as was done before with frozensets of the letter 
combinations. This is shown in the file "str_string_n7_10k.png".

The last 7-bits of the small string hashes produce a distribution much like 
regular pseudorandom integers.

So if there is a problem with the hash algorithm, it appears to be related to 
the frozenset hashing and not strings.

--
Added file: http://bugs.python.org/file45270/str_string_n7_10k.png

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26163] FAIL: test_hash_effectiveness (test.test_set.TestFrozenSet)

2016-10-29 Thread Eric Appelt

Eric Appelt added the comment:

I dug into this failure a little bit and verified that it is specifically the 
"letter_range" portion of the test that sporadically fails. The hash of any 
frozenset constructed from floats, ints, or the empty frozenset, as well as 
frozensets recursively containing any of the previous have deterministic hashes 
that don't vary with the seed.

I isolated the letter_range test for various values of n to see how often this 
failure generally happened. I scanned the first 1 integers set to 
PYTHONHASHSEED and got the following failures:

n=2   -
n=3   -
n=4   300, 1308, 2453, 4196, 5693, 8280, 8353
n=5   4846, 5693
n=6   3974
n=7   36, 1722, 5064, 8562, 8729
n=8   2889, 5916, 5986
n=9   -
n=10  -

I checked to see the behavior of psuedorandom integers in the range 0 to 
2**64-1 by making a large sample of values taken from 
"len({random.randint(0,2**64) & 127 for _ in range(128)})", and found that the 
value of "u" in the test for n=7 if the hashes really are effectively randomly 
distributed follows a gaussian distribution with a mean of ~81 and deviation of 
~3.5. So a value of 31 would be nearly 14 deviations from the mean which seems 
quite unreasonable.

I then took the distribution of the set sizes from the letter_range test for 
n=7 with 10,000 different seeds and plotted it alongside the distribution of 
set sizes from the last 7 bits of pseudorandom numbers in the attached file 
"frozenset_string_n7_10k.png".

The hashes of the frozensets of single letters follows a very different 
distribution. Either this test is inappropriate and will cause sporadic build 
failures, or there is a problem with the hash algorithm.

--
nosy: +Eric Appelt
Added file: http://bugs.python.org/file45269/frozenset_string_n7_10k.png

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26163>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28541] Improve test coverage for json library - loading bytes

2016-10-28 Thread Eric Appelt

Eric Appelt added the comment:

I was able to inspect the review, and implemented your suggestions. This is 
attached as a new patch. Please let me know if this is not the correct workflow.

Thank you for the prompt review and help as I learn the python tracking and 
review process!

--
Added file: http://bugs.python.org/file45255/mywork4.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28541>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28541] Improve test coverage for json library - loading bytes

2016-10-27 Thread Eric Appelt

Eric Appelt added the comment:

Thanks for the feedback. I agree that the comment is incorrect for several 
iterations of the loop that really don't need to be tested at all for '5'. I 
read the previous issue 17909 more carefully along with RFC 4627, 7159, and 
EMCA 404 to properly understand the context here.

I notice that although the issue 17909 patch was written in order to implement 
the patterns in RFC 4627, it is also valid for the additional pattern 
introduced for utf-16-le in RFC 7159 as described in [1].

Specifically, in RFC 4627 the only valid pattern for utf-16-le begins with XX 
00 XX 00 since the first two characters must represent ASCII codepoints. With 
strings the second character can be higher codepoints allowing for XX 00 XX XX 
or XX 00 00 XX. In the implementation from issue 17909 the pattern XX 00 XX is 
first detected and results in utf-16-le, and then the 4th byte is checked for 
the pattern XX 00 00 XX or XX 00 00 00 to result in utf-16-le or utf-32 
respectively.

In the issue 17909 patch the special case of a single character JSON document 
(i.e. '5') is also specifically accounted for. This case is not mentioned in 
[1].

So for everything I can try (or think of), this implementation can correctly 
determine the encoding of any valid JSON document according to RFC 7159. This 
is good since the documentation [2] makes no distinction that json.loads() 
would only accept JSON documents in bytes if they adhere to 4627.

The encoding failure mode described in issue 17909 is still an invalid JSON 
document according to RFC 7159 as the control character \u is not escaped.

So I think overall the implementation is perfectly valid for RFC 7159 / EMCA 
404 as suggested by the standard library documentation [2].

To me it seems reasonable to have a unit test to specifically check that the 
pattern XX 00 00 XX works. For the goal of hitting 100% test coverage as 
measured by line, I believe that the 2-byte case still needs to be tested.

I have attached a patch that covers these cases along with (maybe too) verbose 
comments explaining the edge cases. I also changed the comments in the 
json/__init__.py module itself to clarify the detection patterns as implemented.

I'm not sure how helpful this is to the improvement in of the standard library, 
but it has been very educational to be so far.

[1] http://www.ietf.org/mail-archive/web/json/current/msg01959.html
[2] https://docs.python.org/3.7/library/json.html

--
Added file: http://bugs.python.org/file45246/mywork3.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28541>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28541] Improve test coverage for json library - loading bytes

2016-10-27 Thread Eric Appelt

Eric Appelt added the comment:

I looked back and something is clearly wrong with my coverage reporting setup, 
sorry :(

When I move the test introduced in issue 17909, 'test_bytes_decode' from the 
module Lib/test/test_json/test_unicode.py to Lib/test/test_json/test_decode.py 
that particular test is properly traced and I see that the function is mostly 
covered, so I need to figure out what is going wrong in my setup.

The test already present is more comprehensive as it includes BOM, so I believe 
what I wrote is generally not helpful.

I did notice that the existing test does not cover the edge-case of a 2-byte 
utf-16 sequence that can be generated for a valid JSON object represented by a 
single codepoint, for example '5' or '7'. I created a new patch to augment the 
existing test to cover this special case rather than adding a test.

--
Added file: http://bugs.python.org/file45240/mywork2.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28541>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28541] Improve test coverage for json library - loading bytes

2016-10-26 Thread Eric Appelt

New submission from Eric Appelt:

Increase test coverage of the json library, specifically the detect_encoding() 
function in the __init__ module, which is used to handle the autodetection of 
the encoding of a bytes object passed to json.loads().

This function was added in issue 17909 which extended the json.loads() function 
to accept bytes.

Note that this is a small patch as I am following section 5 of the developer 
guide and I am trying to acquaint myself with the process as a first time 
contributor. I found this missing coverage just by semi-randomly looking at 
individual modules of interest. Please let me know if I have made any general 
mistakes in constructing this ticket. Thanks!

--
components: Tests
files: mywork.patch
keywords: patch
messages: 279521
nosy: Eric Appelt
priority: normal
severity: normal
status: open
title: Improve test coverage for json library - loading bytes
versions: Python 3.6
Added file: http://bugs.python.org/file45234/mywork.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28541>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com