[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2021-10-20 Thread Irit Katriel


Change by Irit Katriel :


--
resolution:  -> wont fix
stage: patch review -> resolved
status: open -> closed
superseder:  -> Close 2to3 issues and list them here

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-09-15 Thread monson


Change by monson :


--
pull_requests: +8757

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-26 Thread Łukasz Langa

Łukasz Langa  added the comment:

I agree with you Serhiy, there's a number things I want to make faster. But 
first I'd like to merge implementations so there is a clear one-way diff ("this 
is what we updated in lib2to3 to make it consistent it Lib/tokenize.py").  Then 
I want to optimize.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-26 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

It seems to me that regular expressions used in the lib2to3 version are more 
efficient but more complex.

$ ./python -m timeit -s 'import re; p = re.compile(r"0[bB](?:_?[01])+"); s = 
"0b"+"_0101"*16' 'p.match(s)'
10 loops, best of 5: 2.45 usec per loop

$ ./python -m timeit -s 'import re; p = re.compile(r"0[bB]_?[01]+(?:_[01]+)*"); 
s = "0b"+"_0101"*16' 'p.match(s)'
20 loops, best of 5: 1.08 usec per loop

$ ./python -m timeit -s 'import re; p = 
re.compile(r"0[xX](?:_?[0-9a-fA-F])+[lL]?"); s = "0x_0123_4567_89ab_cdef"' 
'p.match(s)'
50 loops, best of 5: 815 nsec per loop

$ ./python -m timeit -s 'import re; p = 
re.compile(r"0[xX]_?[\da-fA-F]+(?:_[\da-fA-F]+)*[lL]?"); s = 
"0x_0123_4567_89ab_cdef"' 'p.match(s)'
50 loops, best of 5: 542 nsec per loop

Since the performance of lib2to3 is important, it is better to keep the current 
regexpes.

But using \d in Python 3 is a bug, it should be replaced with [0-9]. This also 
speeds up the regex:

$ ./python -m timeit -s 'import re; p = 
re.compile(r"0[xX]_?[0-9a-fA-F]+(?:_[0-9a-fA-F]+)*[lL]?"); s = 
"0x_0123_4567_89ab_cdef"' 'p.match(s)'
50 loops, best of 5: 471 nsec per loop

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-23 Thread Łukasz Langa

Łukasz Langa  added the comment:


New changeset c2d384dbd7c6ed9bdfaac45f05b463263c743ee7 by Łukasz Langa in 
branch 'master':
bpo-8: [tokenize] Minor code cleanup (#6573)
https://github.com/python/cpython/commit/c2d384dbd7c6ed9bdfaac45f05b463263c743ee7


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-22 Thread Łukasz Langa

Change by Łukasz Langa :


--
pull_requests: +6274

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-22 Thread Łukasz Langa

Change by Łukasz Langa :


--
keywords: +patch
pull_requests: +6269
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-22 Thread Łukasz Langa

New submission from Łukasz Langa :

lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library.  They were copied to allow Python 3 to read
Python 2's grammar.

Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code.  Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.

This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3.  Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py.  After this
change, the diff between the two files is only 175 lines long and is entirely
filled with relevant Python 2 compatibility bits.

Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:

+ docstrings made as similar as possible
+ ported `TokenInfo`
+ ported `tokenize.tokenize()` and `tokenize.open()`
+ removed Python 2-only implementation cruft
+ fixes Unicode identifier handling
+ fixes string prefix handling
+ fixes Ellipsis handling
+ Untokenizer backported bugfixes:
- 5e6db313686c200da425a54d2e0c95fa40107b1d
- 9dc3a36c849c15c227a8af218cfb215abe7b3c48
- 5b8d2c3af76e704926cf5915ad0e6af59a232e61
- e411b6629fb5f7bc01bec89df75737875ce6d8f5
- BPO-2495
+ tokenizer doesn't crash on missing newline at the end of the
stream (added \Z (end of string) to PseudoExtras) - BPO-16152
+ `find_cookie` includes file name in error messages, if available
+ `find_cookie` raises SyntaxError on invalid encodings: BPO-14990

Improvements to lib2to3/pgen2/token.py:

+ taken from the current Lib/token.py
+ tokens renumbered to match Lib/token.py
+ `__all__` properly defined
+ ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number)
+ ELLIPSIS added
+ ENCODING added

--
components: 2to3 (2.x to 3.x conversion tool), Library (Lib)
messages: 315639
nosy: lukasz.langa
priority: normal
severity: normal
status: open
title: [lib2to3] Synchronize token.py and tokenize.py with the standard library
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33338] [lib2to3] Synchronize token.py and tokenize.py with the standard library

2018-04-22 Thread Łukasz Langa

Łukasz Langa  added the comment:

### Diff between files

The unified diff between tokenize implementations is here:
https://gist.github.com/ambv/679018041d85dd1a7497e6d89c45fb86

It clocks at 275 lines but that's because it gives context. The actual diff is
175 lines long.

To make it that small, I needed to move some insignificant bits in
Lib/tokenize.py.  This is what the other PR on this issue is about.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com