Edit report at https://bugs.php.net/bug.php?id=60423&edit=1
ID: 60423 Updated by: fel...@php.net Reported by: amal dot samally at gmail dot com Summary: Segmentation fault with the UTF-8 check regexp in some cases -Status: Open +Status: Bogus Type: Bug Package: PCRE related Operating System: Linux PHP Version: 5.3.8 Block user comment: N Private report: N New Comment: Sorry, but your problem does not imply a bug in PHP itself. For a list of more appropriate places to ask for help using PHP, please visit http://www.php.net/support.php as this bug system is not the appropriate forum for asking support questions. Due to the volume of reports we can not explain in detail here why your report is not a bug. The support channels will be able to provide an explanation for you. Thank you for your interest in PHP. It's a known behavior from PCRE. Check out other bugs reports or the PCRE documentation. Previous Comments: ------------------------------------------------------------------------ [2011-12-02 12:29:48] amal dot samally at gmail dot com gdb output: (gdb) run test.php Starting program: /usr/local/bin/php test.php [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x0000000000498948 in match ( eptr=0x139a566 "1{font-weight:bold}#gbg6.gbgt- hvr,#gbg6.gbgt:focus{background-color:transparent;background- image:none}.gbg4a{font-size:0;line-height:0}.gbg4a .gbts{padding:27px 5px 0;*padding:25px 5px 0}.gbto .gbg4a "..., ecode=0x13d9525 "^", mstart=0x13990f0 "<!doctype html> <head> <title>docs.pravo.ru - ÐоиÑк в Google</title> <script>window.google={kEI:\"hTnXTp- POZDqOabAzMYO\",getEI:function(a){var b;while(a&&! (a.getAttribute&&(b=a.getAttribute"..., markptr=0x0, offset_top=2, md=0x7fffffffb340, ims=0, eptrb=Cannot access memory at address 0x7fffff7feff8 ) at /tmp/php_build/php-5.3.8/ext/pcre/pcrelib/pcre_exec.c:471 471 { (gdb) bt #0 0x0000000000498948 in match ( eptr=0x139a566 "1{font-weight:bold}#gbg6.gbgt- hvr,#gbg6.gbgt:focus{background-color:transparent;background- image:none}.gbg4a{font-size:0;line-height:0}.gbg4a .gbts{padding:27px 5px 0;*padding:25px 5px 0}.gbto .gbg4a "..., ecode=0x13d9525 "^", mstart=0x13990f0 "<!doctype html> <head> <title>docs.pravo.ru - ÐоиÑк в Google</title> <script>window.google={kEI:\"hTnXTp- POZDqOabAzMYO\",getEI:function(a){var b;while(a&&! (a.getAttribute&&(b=a.getAttribute"..., markptr=0x0, offset_top=2, md=0x7fffffffb340, ims=0, eptrb=Cannot access memory at address 0x7fffff7feff8 ) at /tmp/php_build/php-5.3.8/ext/pcre/pcrelib/pcre_exec.c:471 #1 0x000000000049b352 in match ( eptr=0x139a566 "1{font-weight:bold}#gbg6.gbgt- hvr,#gbg6.gbgt:focus{background-color:transparent;background- image:none}.gbg4a{font-size:0;line-height:0}.gbg4a .gbts{padding:27px 5px 0;*padding:25px 5px 0}.gbto .gbg4a "..., ecode=0x13d9748 "V\002#\033U\002,", mstart=0x13990f0 "<!doctype html> <head> <title>docs.pravo.ru - ÐоиÑк в Google</title> <script>window.google={kEI:\"hTnXTp- POZDqOabAzMYO\",getEI:function(a){var b;while(a&&! (a.getAttribute&&(b=a.getAttribute"..., markptr=0x0, offset_top=2, md=0x7fffffffb340, ims=0, eptrb=0x0, flags=0, rdepth=10464) at /tmp/php_build/php-5.3.8/ext/pcre/pcrelib/pcre_exec.c:1654 #2 0x00000000004994e0 in match ( eptr=0x139a565 "s1{font-weight:bold}#gbg6.gbgt- hvr,#gbg6.gbgt:focus{background-color:transparent;background- image:none}.gbg4a{font-size:0;line-height:0}.gbg4a .gbts{padding:27px 5px 0;*padding:25px 5px 0}.gbto .gbg4a"..., ecode=0x13d9525 "^", mstart=0x13990f0 "<!doctype html> <head> <title>docs.pravo.ru - ÐоиÑк в Google</title> <script>window.google={kEI:\"hTnXTp- POZDqOabAzMYO\",getEI:function(a){var b;while(a&&! (a.getAttribute&&(b=a.getAttribute"..., markptr=0x0, offset_top=2, md=0x7fffffffb340, ims=0, eptrb=0x0, flags=0, rdepth=10463) at /tmp/php_build/php-5.3.8/ext/pcre/pcrelib/pcre_exec.c:885 #3 0x000000000049b352 in match ( eptr=0x139a565 "s1{font-weight:bold}#gbg6.gbgt- hvr,#gbg6.gbgt:focus{background-color:transparent;background- image:none}.gbg4a{font-size:0;line-height:0}.gbg4a .gbts{padding:27px 5px 0;*padding:25px 5px 0}.gbto .gbg4a"..., ecode=0x13d9748 "V\002#\033U\002,", mstart=0x13990f0 "<!doctype html> <head> <title>docs.pravo.ru - ÐоиÑк в Google</title> <s---Type <return> to continue, or q <return> to quit--- ------------------------------------------------------------------------ [2011-12-01 14:46:10] larue...@php.net Thank you for this bug report. To properly diagnose the problem, we need a backtrace to see what is happening behind the scenes. To find out how to generate a backtrace, please read http://bugs.php.net/bugs-generating-backtrace.php for *NIX and http://bugs.php.net/bugs-generating-backtrace-win32.php for Win32 Once you have generated a backtrace, please submit it to this bug report and change the status back to "Open". Thank you for helping us make PHP better. ------------------------------------------------------------------------ [2011-12-01 10:57:33] amal dot samally at gmail dot com I think not. Also changing pcre.backtrack_limit / pcre.recursion_limit do not give anything. ------------------------------------------------------------------------ [2011-12-01 10:10:52] larue...@php.net see #41638, may be the same. ------------------------------------------------------------------------ [2011-12-01 09:04:37] amal dot samally at gmail dot com Description: ------------ I'm using the regexp to test whether a string is a valid UTF-8 encoded string. But in some cases it causes a segmentation fault. Examples of strings that cause the error: http://samally.ru/php_pcre_segmentation_fault/test1.txt http://samally.ru/php_pcre_segmentation_fault/test2.txt Test script: --------------- $string = file_get_contents('http://samally.ru/php_pcre_segmentation_fault/test1.txt'); // $string = file_get_contents('http://samally.ru/php_pcre_segmentation_fault/test2.txt'); // Tests whether a string is a valid UTF-8 encoded string. // @link http://w3.org/International/questions/qa-forms-utf-8.html $r = preg_match('~^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII without control characters | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$~DSXx', $string); ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=60423&edit=1