mbstring does not support numeric entities in HTML code. For example:
echo urlencode( mb_convert_encoding("Е", "UTF-8", "HTML-ENTITIES") );
displays %F2%AF%B8%9F rather than the expected %D0%95.
This bug was detected by Nick Wedd <[EMAIL PROTECTED]> and reported in the
newsgroup comp.lang.php, Message-ID: <[EMAIL PROTECTED]>.
I'd found the bug in the file ext/mbstring/libmbfl/filters/mbfilter_htmlent.c
and added these features:
- decode hex entities &xHHHH;
- detect invalid digits
- detect digits missing at all
- detect values out of the range 0-0xffff
Invalid values are returned verbatim.
Apparently the right place for this patch should be
http://cvs.sourceforge.jp/cgi-bin/viewcvs.cgi/php-i18n/
but currently the project isn't no more hosted there.
The patch for ext/mbstring/libmbfl/filters/mbfilter_htmlent.c follows:
173a174,217
> static int mbfl_decode_numeric_entity(char *s, int s_len)
> /*
> s = numeric entity "ddd" or "xhhhh"
> return: numeric value or -1 if not inside [0,0xffff] or invalid digits
> */
> {
> int ent, pos, c, d;
>
> ent = 0;
>
> if (*s == 'x' || *s == 'X') {
> /* hexadecimal base */
> if ( s_len < 2 )
> return -1; /* no digits found */
> for (pos=1; pos<s_len; pos++) {
> c = s[pos];
> if (isdigit(c))
> d = c - '0';
> else if (isxdigit(c))
> d = tolower(c) - 'a' + 10;
> else
> return -1; /* invalid hex digit */
> ent = (ent << 4) + d;
> if (ent > 0xffff)
> return -1; /* too big */
> }
>
> } else {
> /* decimal base */
> if ( s_len < 1 )
> return -1; /* no digits found */
> for (pos=0; pos<s_len; pos++) {
> c = s[pos];
> if (! isdigit(c) )
> return -1; /* invalid dec char */
> ent = ent*10 + (c - '0');
> if (ent > 0xffff)
> return -1; /* too big */
> }
> }
>
> return ent;
> }
>
192,193c236,246
< for (pos=2; pos<filter->status; pos++) {
< ent = ent*10 + (buffer[pos] - '0');
---
> ent = mbfl_decode_numeric_entity(&buffer[2],
> filter->status - 2);
> if( ent >= 0 ){
> CK((*filter->output_function)(ent,
> filter->data));
> filter->status = 0;
> /*php_error_docref("ref.mbstring"
> TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/
> } else {
> /* failure */
> buffer[filter->status++] = ';';
> buffer[filter->status] = 0;
> /* php_error_docref("ref.mbstring"
> TSRMLS_CC, E_WARNING, "mbstring cannot decode '%s'", buffer); */
> mbfl_filt_conv_html_dec_flush(filter);
195,197d247
< CK((*filter->output_function)(ent,
filter->data));
< filter->status = 0;
< /*php_error_docref("ref.mbstring" TSRMLS_CC,
E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/
Best regards,
___
/_|_\ Umberto Salsi
\/_\/ www.icosaedro.it
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php