https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93502
Bug ID: 93502 Summary: std::regex_match uses stack space proportional to input string Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: nyh at math dot technion.ac.il Target Milestone: --- Created attachment 47736 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47736&action=edit Simple test code demonstrating the bug As we all know from Automata Theory course, the finite automaton used to match a regular expression takes a constant amount of memory, unrelated to the size of the input string. The attached test program creates a string with N (the first command-line parameter) spaces, and then checks whether it matches the trivial regular expression " *" using std::regex_match(). Obviously it should match, and take constant amount of memory unrelated to the input string length (N). But it turns out that std::regex_match() does take more memory on the stack the bigger N is (the longer the input string is). For example: $ c++ -fsanitize=address z.cc $ ./a.out 10000 Matched! $ ./a.out 100000 ... SUMMARY: AddressSanitizer: stack-overflow (/tmp/a.out+0x42a88b) in std::function<bool (char)>::operator()(char) const We have seen this problem in a real application - which crashed with a stack overflow because it tried to match a simple regular expression to a long input string. The attached program is of course only a simplified example. Trying to figure out where this excessive stack use comes from, I discovered that the boolean std::regex_match() actually calls a different overload, which returns not just the boolean, but also the match itself: regex_match(_Bi_iter __first, _Bi_iter __last, const basic_regex<_Ch_type, _Rx_traits>& __re, regex_constants::match_flag_type __flags = regex_constants::match_default) { match_results<_Bi_iter> __what; return regex_match(__first, __last, __what, __re, __flags); } However, I'm not sure if or why this explains the excessive stack suage - shouldn't match_results contain only the beginning and end pointers of the match - and not a full copy of the match? But whether or not this implementation explains the bug, it's definitely a bug. Even if for some reason (?) the regex_match overload with match_results cannot be implemented with constant memory usage, the one I am complaining about - the one which does not need to return the match results, just a boolean - definitely shouldn't take more stack memory the bigger the input string is.