[issue47256] re: limit the maximum capturing group to 1, 073, 741, 823, reduce sizeof(match_context).

2022-04-08 Thread Ma Lin


Change by Ma Lin :


--
keywords: +patch
pull_requests: +30437
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/32411

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue47256] re: limit the maximum capturing group to 1, 073, 741, 823, reduce sizeof(match_context).

2022-04-08 Thread Ma Lin

New submission from Ma Lin :

These changes reduce sizeof(match_context):
- 32-bit build: 36 bytes, no change.
- 64-bit build: 72 bytes -> 56 bytes.

sre uses stack and `match_context` struct to simulate recursive call, smaller 
struct brings:
- deeper recursive call
- less memory consume
- less memory realloc

Here is a test, if limit the stack size to 1 GiB, the max available value of n 
is:

re.match(r'(ab)*', n * 'ab')   # need to save MARKs
72 bytes: n = 11,184,808
64 bytes: n = 12,201,609
56 bytes: n = 13,421,770

re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
72 bytes: n = 13,421,770
64 bytes: n = 14,913,078
56 bytes: n = 16,777,213

1,073,741,823 capturing groups should enough for almost all users.
If limit it to 16,383 (2-byte integer), the context size may reduce more. But 
maybe some patterns generated by program will have more than this number of 
capturing groups.

1️⃣Performance:

Before
regex_dna: Mean +- std dev: 149 ms +- 1 ms
regex_effbot: Mean +- std dev: 2.22 ms +- 0.02 ms
regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
my benchmark[1]: 13.9 sec +- 0.0 sec

Commit 1. limit the maximum capture group to 1,073,741,823
regex_dna: Mean +- std dev: 150 ms +- 1 ms
regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
regex_v8: Mean +- std dev: 22.3 ms +- 0.1 ms
my benchmark: 13.8 sec +- 0.0 sec

Commit 2. further reduce sizeof(SRE(match_context))
regex_dna: Mean +- std dev: 150 ms +- 1 ms
regex_effbot: Mean +- std dev: 2.16 ms +- 0.02 ms
regex_v8: Mean +- std dev: 22.2 ms +- 0.1 ms
my benchmark: 13.8 sec +- 0.1 sec

If further change the types of toplevel/jump from int to char, in 32-bit build 
sizeof(match_context) will be reduced from 36 to 32 (In 64-bit build still 56). 
But it's slower on 64-bit build, so I didn't adopt it:
regex_dna: Mean +- std dev: 150 ms +- 1 ms
regex_effbot: Mean +- std dev: 2.18 ms +- 0.01 ms
regex_v8: Mean +- std dev: 22.4 ms +- 0.1 ms
my benchmark: 14.1 sec +- 0.0 sec

2️⃣ The type of match_context.count is Py_ssize_t
- If change it to 4-byte integer, need to modify some engine code.
- If keep it as Py_ssize_t, SRE_MAXREPEAT may >= 4 GiB in future versions.  
  Currently SRE_MAXREPEAT can't >= 4 GiB.
So the type of match_context.count is unchanged.

[1] My re benchmark, it uses 16 patterns to process 100 MiB text data:
https://github.com/animalize/re_benchmarks

--
components: Library (Lib)
messages: 416960
nosy: ezio.melotti, malin, mrabarnett, serhiy.storchaka
priority: normal
severity: normal
status: open
title: re: limit the maximum capturing group to 1,073,741,823, reduce 
sizeof(match_context).
type: resource usage
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com