mbstring does not support numeric entities in HTML code. For example:

echo urlencode( mb_convert_encoding("Е", "UTF-8", "HTML-ENTITIES") );

displays %F2%AF%B8%9F rather than the expected %D0%95.
This bug was detected by Nick Wedd <[EMAIL PROTECTED]> and reported in the
newsgroup comp.lang.php, Message-ID: <[EMAIL PROTECTED]>.

I'd found the bug in the file ext/mbstring/libmbfl/filters/mbfilter_htmlent.c
and added these features:

- decode hex entities &xHHHH;
- detect invalid digits
- detect digits missing at all
- detect values out of the range 0-0xffff

Invalid values are returned verbatim.

Apparently the right place for this patch should be
http://cvs.sourceforge.jp/cgi-bin/viewcvs.cgi/php-i18n/
but currently the project isn't no more hosted there.

The patch for ext/mbstring/libmbfl/filters/mbfilter_htmlent.c follows:

173a174,217
> static int mbfl_decode_numeric_entity(char *s, int s_len)
> /*
>       s = numeric entity "ddd" or "xhhhh"
>       return: numeric value or -1 if not inside [0,0xffff] or invalid digits
> */
> {
>       int ent, pos, c, d;
> 
>       ent = 0;
> 
>       if (*s == 'x' || *s == 'X') {
>               /* hexadecimal base */
>               if ( s_len < 2 )
>                       return -1;  /* no digits found */
>               for (pos=1; pos<s_len; pos++) {
>                       c = s[pos];
>                       if (isdigit(c))
>                               d = c - '0';
>                       else if (isxdigit(c))
>                               d = tolower(c) - 'a' + 10;
>                       else
>                               return -1;  /* invalid hex digit */
>                       ent = (ent << 4) + d;
>                       if (ent > 0xffff)
>                               return -1;  /* too big */
>               }
> 
>       } else {
>               /* decimal base */
>               if ( s_len < 1 )
>                       return -1;  /* no digits found */
>               for (pos=0; pos<s_len; pos++) {
>                       c = s[pos];
>                       if (! isdigit(c) )
>                               return -1;  /* invalid dec char */
>                       ent = ent*10 + (c - '0');
>                       if (ent > 0xffff)
>                               return -1;  /* too big */
>               }
>       }
> 
>       return ent;
> }
> 
192,193c236,246
<                               for (pos=2; pos<filter->status; pos++) {
<                                       ent = ent*10 + (buffer[pos] - '0');
---
>                               ent = mbfl_decode_numeric_entity(&buffer[2], 
> filter->status - 2);
>                               if( ent >= 0 ){
>                                       CK((*filter->output_function)(ent, 
> filter->data));
>                                       filter->status = 0;
>                                       /*php_error_docref("ref.mbstring" 
> TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/
>                               } else {
>                                       /* failure */
>                                       buffer[filter->status++] = ';';
>                                       buffer[filter->status] = 0;
>                                       /* php_error_docref("ref.mbstring" 
> TSRMLS_CC, E_WARNING, "mbstring cannot decode '%s'", buffer); */
>                                       mbfl_filt_conv_html_dec_flush(filter);
195,197d247
<                               CK((*filter->output_function)(ent, 
filter->data));
<                               filter->status = 0;
<                               /*php_error_docref("ref.mbstring" TSRMLS_CC, 
E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/


Best regards,
 ___ 
/_|_\  Umberto Salsi
\/_\/  www.icosaedro.it

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to