------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1347 --- Comment #3 from [email protected] 2013-04-16 18:42:55 --- -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Phil, and thanks for this (even if canne) response - I feared as much ;-) >> For a website using Syntax Highlighting based on GeSHi 1.0.X there's a >> reproduceable crash for certain inputs that causes a Segfault while the >> highlighting is performed. > I haven't time to look at this in detail just at the moment, but as you > mention the word "recursion", it may be just normal PCRE behaviour ... > using a lot of stack. Actually I already had several occasions (or rather did the users of GeSHi) have spurious crashes based on this behaviour. So let's have a look at the various points. > This is not a bug, though it is frequently mis-reported as one, so much > so that I keep this standard response on file: Unfortunately that's the default behaviour for most distributions causing quite some hard to debug crashes as (see stacktrace of my original report) there's not much relation between e.g. the executed PHP code and the actual crash down in libpcre - thus I'm getting quite regularly reports on crashes where there's not much debugging that can be done. In this case I got quite lucky as there's a clear sample causing the trouble where even the code to reproduce is quite minimal. > ------------------------------------------------------------------------ > 1. Matching a regular expression is like finding your way through a > forest with many branching paths. As PCRE passes each junction, it has > to remember data so that it can backtrack to that point if necessary. By > default, it uses recursion to store this data on the process stack, > because that is fast. However, it can alternatively be compiled to use > the heap instead (run ./configure with --disable-stack-for-recursion), > but that slows performance. Unfortunately not an option when writing a library that's dependent on the system-installed versions and configurations of things. But even than I think here's the best way to workaround the issue. How much work would it be to allow for libpcre to be compiled ina kinda "hybrid mode" where stack-based matching is tried first and if reaching the (stack) recursion limit a retry using the heap-based implementation is performed with much higher limits? Given (as you say so yourself below) that usually you can rewrite regular expressions to avoid recursion as much as possible this could be a fallback that solved two issues: - - libpcre causing random crashes due to complex queries - - if your task is complex enough to exceed "softlimits" usually performance won't matter anymore Having a slow solution is better than having none at all. > 2. It is very easy to write a regular expression that has a very large > number of branches (unlimited repetition of a group, for example). When > PCRE goes deep into such a tree, it may use a lot of memory. I know. And I guess I'm hitting something like this in the current situation. Is there a way to get some "stats" for a regexp like max recursion depth, taken branches, steps needed for solution, ...? > 3. Even in these days of gigabyte main memories, some operating system > environments set small default limits on the maximum size of the process > stack, for example, 8Mb. Thus, it is often the case that there is more > heap than stack available (by default). A matching operation that needs > a lot of memory may succeed if the heap is used, but run out of memory > if the stack is used. See suggestion above. > 4. Running out of stack often causes a segfault. Because of this, PCRE > contains the facility to limit the depth of recursion so as to return an > error code instead. However, the default value is large, so it does not > normally come into play unless you explicitly set a smaller value. Unfortunately PHP has a horrible interface in regards to this: I don't really see that error code and usually I just get some empty return string instead, so I can't even check for the recursion stack overflow condition. And even than, due to the way GeSHi works there are quite some regular expressions to be evaluated and thus checking for each and every return value would be quite a hit on performance. > 5. If you are running into a problem of stack overflow, you have the > following choices: > > (a) Work on your regular expression pattern so that it uses less > memory. Sometimes using atomic groups can help with this. Will need some debugging aid for this. > (b) Increase the size of your process stack. Can't influence this for PHP. Also GeSHi is a library --> no chance. > (c) Compile PCRE to use the heap instead of the stack. Can't influence this. Would have to be changed in the distribution. -> No chance. > (d) Set PCRE's recursion limit small enough so that it gives an error > before the stack overflows. Well, this should avoid the crash, sure. But won't help with the actual issue of the expression complexity. > 6. There is more discussion of some of these ideas in the pcrestack.3 > man page. Some questions on that: - - Do you know if PHP uses the pcrejit option? And if yes, since around which version? - - Given something like ^\d+(?:(?:A|B|BC|BD|C|DB)+\d+)*$ What would be the best way to ask libpcre to try various combinations of the letters while avoiding backtracking once one such consecutive group has been found? > Philip I think the issue itself might reside on both sides, thus some assistance with debugging and creating a suiteable solution would be nice. There aren't any real time constraints, thus if finding time to look at things in more detail takes a while it shouldn't matter. Best regards, BenBE. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBAgAGBQJRbY17AAoJEPHTXLno4S6tptUP/A318JVBTH8gXdPpKMgNuZ9u Y29Vg3r3hhe/CEWWufkuKRFNSon9/B8g9nsX0p865bhe5fZWVEVJwv+32yvyNmNR bknD3QpRy0WB9i/IQx6WUF0SicWrGUPDuDJ0KncAHMB0Y6SMzKHefVRv7G6gPJsd C+u9q+XWyK1ofHFD4EkVJZD0AEaewfo4/Km1HcA9ezAHL/QFMgGIdUytwH6GJCFH 9Sq8/vIBYpLKa9e33N8Ajryqd8j8EEQRv8n+DFR9HnQPtyZEW8YJBWoO/Wz2Hm7u CvQQaf5t9VOmKLLKLscTGZOp72tXEe/EQTNO5VxjcXljTon6GNFw3nbStz3iHkdS C6MU9GXwesryqNNHlWekOhZ6TkN1niM6+J/9eZpR+0QESNEGI60GhW64jENfI381 /Z5Vj4syEsVoYEeh2OdeD4zppr9MfWPxh3RP09I5mOFZ5NHtgxcDvI0sJGZAoMc4 vEgZ6s85Xde05zb8V2a3dYQolUDRKEDnmgs2x8IqXo28svw3eGIjGX14TtRURB9+ +UZIBVAJdFTHvGW+ouL4crk82ceZxTGsjNWzkauyZRQ103LYJpNCXhspIPf2r3Y2 O42tlEjF2rh94j+lPKBJySOl3GahCbKLUGcJMnMOtdoY2pOdhrYrZUiBq6V5aLJn /J6SdLAoxWtir5ExwsKY =7Kv9 -----END PGP SIGNATURE----- -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
