Re: Regarding Regex timeout behavior to minimize CPU consumption
On Sun, Dec 6, 2020 at 2:37 PM Barry wrote: > > On 5 Dec 2020, at 23:44, Peter J. Holzer wrote: > > > > On 2020-12-05 23:42:11 +0100, sjeik_ap...@hotmail.com wrote: > >> Timeout: no idea. But check out re.compile and re.iterfind as they > might > >> speed things up. > > > > I doubt that compiling regular expressions helps the OP much. Compiled > > regular expressions are cached, but more importantly, if a match takes > > long enough that specifying a timeout is useful, the time is almost > > certainly not spent compiling, but matching - most likely backtracking > > from lots of promising but ultimately unsuccessful partial matches. > > > >> regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \ > >> > >> > > r'(?:class\s*=\s*"\s*sticky-book-now\s*"|\s*|id\s*=\s*"Location")' > >> rooms_blocks_to_be_replace = re.findall(regex, html_template) > > > > This part: > > > >\s*([\s\S]*?)\s*' > > > > looks dangerous from a performance point of view. If that can be > > rewritten with less potential for backtracking, it might help. > > > > Generally, it should be possible to implement a timeout for any > > operation by either scheduling an alarm with signal.alarm or by > > executing the operation in a separate thread and killing the thread if > > it takes too long. > > I think that python ignores signals until the coeval loop is entered. > And since the re.match will block that is not going to happen. > > Killing threads is not safe and if your OS allows it then you end up with > the internal state of python messed up. > > To implement this I think requires the re code to implement the timeout. > > Better is for the OP to fix the re to not back track so much or to work on > the > input string in chunks. > If the regex is expensive enough to warrant it, you could use a subprocess - they are killable. -- https://mail.python.org/mailman/listinfo/python-list
Re: Regarding Regex timeout behavior to minimize CPU consumption
> On 5 Dec 2020, at 23:44, Peter J. Holzer wrote: > > On 2020-12-05 23:42:11 +0100, sjeik_ap...@hotmail.com wrote: >> Timeout: no idea. But check out re.compile and re.iterfind as they might >> speed things up. > > I doubt that compiling regular expressions helps the OP much. Compiled > regular expressions are cached, but more importantly, if a match takes > long enough that specifying a timeout is useful, the time is almost > certainly not spent compiling, but matching - most likely backtracking > from lots of promising but ultimately unsuccessful partial matches. > >> regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \ >> >> >> r'(?:class\s*=\s*"\s*sticky-book-now\s*"|\s*|id\s*=\s*"Location")' >> rooms_blocks_to_be_replace = re.findall(regex, html_template) > > This part: > >\s*([\s\S]*?)\s*' > > looks dangerous from a performance point of view. If that can be > rewritten with less potential for backtracking, it might help. > > Generally, it should be possible to implement a timeout for any > operation by either scheduling an alarm with signal.alarm or by > executing the operation in a separate thread and killing the thread if > it takes too long. I think that python ignores signals until the coeval loop is entered. And since the re.match will block that is not going to happen. Killing threads is not safe and if your OS allows it then you end up with the internal state of python messed up. To implement this I think requires the re code to implement the timeout. Better is for the OP to fix the re to not back track so much or to work on the input string in chunks. Barry > >hp > > -- > _ | Peter J. Holzer| Story must make more sense than reality. > |_|_) || > | | | h...@hjp.at |-- Charles Stross, "Creative writing > __/ | http://www.hjp.at/ | challenge!" > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Regarding Regex timeout behavior to minimize CPU consumption
On 2020-12-05 23:42:11 +0100, sjeik_ap...@hotmail.com wrote: >Timeout: no idea. But check out re.compile and re.iterfind as they might >speed things up. I doubt that compiling regular expressions helps the OP much. Compiled regular expressions are cached, but more importantly, if a match takes long enough that specifying a timeout is useful, the time is almost certainly not spent compiling, but matching - most likely backtracking from lots of promising but ultimately unsuccessful partial matches. > regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \ > > > r'(?:class\s*=\s*"\s*sticky-book-now\s*"|\s*|id\s*=\s*"Location")' > rooms_blocks_to_be_replace = re.findall(regex, html_template) This part: \s*([\s\S]*?)\s*' looks dangerous from a performance point of view. If that can be rewritten with less potential for backtracking, it might help. Generally, it should be possible to implement a timeout for any operation by either scheduling an alarm with signal.alarm or by executing the operation in a separate thread and killing the thread if it takes too long. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Regarding Regex timeout behavior to minimize CPU consumption
Hi, Timeout: no idea. But check out re.compile and re.iterfind as they might speed things up. I often compile a regex once upon import, then use it in functions On 27 Nov 2020 13:33, Shahique Khan wrote: Hi Team, I have noticed if our regex sometimes does not give a result and on that time regex took more time in returning response (empty response). My Question is can we set a timeout parameter (in seconds/millisecond) with re.find or anywhere in code to avoid CPU consumption if regex takes more time in execution. Below is the example, which take more time in execution: (in this case can we set timeout to kill the execution to avoid CPU consumption) regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \ r'(?:class\s*=\s*"\s*sticky-book-now\s*"|\s*|id\s*=\s*"Location")' rooms_blocks_to_be_replace = re.findall(regex, html_template) Please help me, I will be very thankful for this. Thanks, -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Regarding Regex timeout behavior to minimize CPU consumption
Hi Team, I have noticed if our regex sometimes does not give a result and on that time regex took more time in returning response (empty response). My Question is can we set a timeout parameter (in seconds/millisecond) with re.find or anywhere in code to avoid CPU consumption if regex takes more time in execution. Below is the example, which take more time in execution: (in this case can we set timeout to kill the execution to avoid CPU consumption) regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \ r'(?:class\s*=\s*"\s*sticky-book-now\s*"|\s*|id\s*=\s*"Location")' rooms_blocks_to_be_replace = re.findall(regex, html_template) Please help me, I will be very thankful for this. Thanks, -- https://mail.python.org/mailman/listinfo/python-list