[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Pablo Galindo Salgado


Change by Pablo Galindo Salgado :


--
nosy: +pablogsal
nosy_count: 4.0 -> 5.0
pull_requests: +23132
pull_request: https://github.com/python/cpython/pull/24313

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

re.compile() already uses caching. But it is less efficient for some reasons.

To Steven: the time is *reduced* by 28%, but the speed is *increased* by 39%.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Steven D'Aprano

Steven D'Aprano  added the comment:

Just for the record:

> The optimization takes the execution from ~6300ms to ~4500ms on my machine 
> (representing a 28% - 39% improvement depending on how you calculate it)

The correct answer is 28%, which uses the initial value as the base: 
(6300-4500)/6300 ≈ 28%. You are starting at 6300ms and speeding it up by 28%:

>>> 6300 - 28/100*6300
4536.0

Using 4500 as the base would only make sense if you were calculating a slowdown 
from 4500ms to 6300ms: we started at 4500 and *increase* the time by 39%:

>>> 4500 + 39/100*4500
6255.0


Hope this helps.

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Batuhan Taskaya


Change by Batuhan Taskaya :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Batuhan Taskaya


Batuhan Taskaya  added the comment:


New changeset 15bd9efd01e44087664e78bf766865a6d2e06626 by Anthony Sottile in 
branch 'master':
bpo-43014: Improve performance of tokenize.tokenize by 20-30%
https://github.com/python/cpython/commit/15bd9efd01e44087664e78bf766865a6d2e06626


--
nosy: +BTaskaya

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Anthony Sottile  added the comment:

attached out3.pstats / out3.svg which represent the optimization using 
lru_cache instead

--
Added file: https://bugs.python.org/file49764/out3.svg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Change by Anthony Sottile :


Added file: https://bugs.python.org/file49763/out3.pstats

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Anthony Sottile  added the comment:

admittedly anecdotal but here's another data point in addition to the profiles 
attached

test.test_tokenize suite before:

$ ./python -m test.test_tokenize
..
--
Ran 78 tests in 77.148s

OK


test.test_tokenize suite after:

$ ./python -m test.test_tokenize
..
--
Ran 78 tests in 61.269s

OK

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Change by Anthony Sottile :


--
keywords: +patch
pull_requests: +23130
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24311

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Change by Anthony Sottile :


Added file: https://bugs.python.org/file49762/out2.svg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Change by Anthony Sottile :


Added file: https://bugs.python.org/file49761/out2.pstats

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


Change by Anthony Sottile :


Added file: https://bugs.python.org/file49760/out.svg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43014] tokenize spends a lot of time in `re.compile(...)`

2021-01-24 Thread Anthony Sottile


New submission from Anthony Sottile :

I did some profiling (attached a few files here with svgs) of running this 
script:

```python
import io
import tokenize

# picked as the second longest file in cpython
with open('Lib/test/test_socket.py', 'rb') as f:
bio = io.BytesIO(f.read())


def main():
for _ in range(10):
bio.seek(0)
for _ in tokenize.tokenize(bio.readline):
pass

if __name__ == '__main__':
exit(main())
```


the first profile is before the optimization, the second is after the 
optimization

The optimization takes the execution from ~6300ms to ~4500ms on my machine 
(representing a 28% - 39% improvement depending on how you calculate it)

(I'll attach the pstats and svgs after creation, seems I can only attach one 
file at once)

--
components: Library (Lib)
files: out.pstats
messages: 385572
nosy: Anthony Sottile
priority: normal
severity: normal
status: open
title: tokenize spends a lot of time in `re.compile(...)`
type: performance
versions: Python 3.10, Python 3.9
Added file: https://bugs.python.org/file49759/out.pstats

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com