#33898 [Opn-Bgs]: basename() misbehaves on multibyte characters

2005-07-28 Thread tony2001
 ID:   33898
 Updated by:   [EMAIL PROTECTED]
 Reported By:  feldgendler at mail dot ru
-Status:   Open
+Status:   Bogus
 Bug Type: Filesystem function related
 Operating System: Debian GNU/Linux i686
 PHP Version:  5.0.4
 New Comment:

See bug #33260.


Previous Comments:


[2005-07-28 11:15:46] feldgendler at mail dot ru

I've explored the source code of php_basename() function, and here is
what I found:

In case of a multi-byte character (inc_len  1) that immediately
follows a slash, state is not changed to 1 because that code is
skipped. 

The following code:

if (state == 0) {
comp = c;
state = 1;
}

...needs to be inserted to the point marked below:

while (cnt  0) {
inc_len = (*c == '\0' ? 1: php_mblen(c, cnt));

switch (inc_len) {
case -2:
case -1:
inc_len = 1;
php_mblen(NULL, 0);
break;
case 0:
goto quit_loop;
case 1:
#if defined(PHP_WIN32) || defined(NETWARE)
if (*c == '/' || *c == '\\') {
#else
if (*c == '/') {
#endif
if (state == 1) {
state = 0;
cend = c;
}
} else {
if (state == 0) {
comp = c;
state = 1;
}
}
default:
-- HERE IT GOES --
break;
}
c += inc_len;
cnt -= inc_len;
}

Can I expect that this bug will be fixed in CVS?



[2005-07-28 10:59:54] feldgendler at mail dot ru

Description:

The source code in my testcase is in UTF-8 encoding itself. The quoted
string contains Cyrillic letters. If I save the source code in KOI8-R
(single-byte) Cyrillic encoding, and change the second argument to
setlocale() to ru_RU.KOI8-R, the observed result is what I expect.
This shows that the bug only occurs on multi-byte characters, because
in KOI8-R all characters are single-byte.

Relevant PHP configuration options:
--enable-mbstring=all
(--enable-zend-multibyte was not specified)

Relevant environment variables:
LANG=en_US.UTF-8
(LC_* are not set)

Reproduce code:
---
?php

setlocale(LC_CTYPE, en_US.UTF-8);
echo basename(english/ÒÕÓÓËÉÊ);

?

Expected result:

ÒÕÓÓËÉÊ

Actual result:
--
english





-- 
Edit this bug report at http://bugs.php.net/?id=33898edit=1


#33898 [Opn-Bgs]: basename() misbehaves on multibyte characters

2005-07-28 Thread tony2001
 ID:   33898
 Updated by:   [EMAIL PROTECTED]
 Reported By:  feldgendler at mail dot ru
-Status:   Open
+Status:   Bogus
 Bug Type: Filesystem function related
 Operating System: Debian GNU/Linux i686
 PHP Version:  5.0.4
 New Comment:

What do you mean? Doesn't PHP 5.0.4, with all its multi-
byte capabilities, support Unicode?

Yes, full multibyte support is planned for 5.2.

And last, what's wrong with my proposed modification? 

We don't need a workaround for a particular problem while patches
fixing all multibyte-related problems are ready and being tested.
Also, please do use `diff -u` next time when you post patches. Thanks. 


Previous Comments:


[2005-07-28 11:41:48] feldgendler at mail dot ru

A message in that bug says Sorry, but this is not supported yet.
You'll have to wait for PHP that supports unicode.

What do you mean? Doesn't PHP 5.0.4, with all its multi-byte
capabilities, support Unicode?

I've searched the bug database and found that there were similar bugs
(30105, 30014, 28981) that are currently in No feedback, not Bogus
state. Why is this bug Bogus?

And last, what's wrong with my proposed modification? Doesn't it fix
the bug?



[2005-07-28 11:31:26] [EMAIL PROTECTED]

See bug #33260.



[2005-07-28 11:15:46] feldgendler at mail dot ru

I've explored the source code of php_basename() function, and here is
what I found:

In case of a multi-byte character (inc_len  1) that immediately
follows a slash, state is not changed to 1 because that code is
skipped. 

The following code:

if (state == 0) {
comp = c;
state = 1;
}

...needs to be inserted to the point marked below:

while (cnt  0) {
inc_len = (*c == '\0' ? 1: php_mblen(c, cnt));

switch (inc_len) {
case -2:
case -1:
inc_len = 1;
php_mblen(NULL, 0);
break;
case 0:
goto quit_loop;
case 1:
#if defined(PHP_WIN32) || defined(NETWARE)
if (*c == '/' || *c == '\\') {
#else
if (*c == '/') {
#endif
if (state == 1) {
state = 0;
cend = c;
}
} else {
if (state == 0) {
comp = c;
state = 1;
}
}
default:
-- HERE IT GOES --
break;
}
c += inc_len;
cnt -= inc_len;
}

Can I expect that this bug will be fixed in CVS?



[2005-07-28 10:59:54] feldgendler at mail dot ru

Description:

The source code in my testcase is in UTF-8 encoding itself. The quoted
string contains Cyrillic letters. If I save the source code in KOI8-R
(single-byte) Cyrillic encoding, and change the second argument to
setlocale() to ru_RU.KOI8-R, the observed result is what I expect.
This shows that the bug only occurs on multi-byte characters, because
in KOI8-R all characters are single-byte.

Relevant PHP configuration options:
--enable-mbstring=all
(--enable-zend-multibyte was not specified)

Relevant environment variables:
LANG=en_US.UTF-8
(LC_* are not set)

Reproduce code:
---
?php

setlocale(LC_CTYPE, en_US.UTF-8);
echo basename(english/ÒÕÓÓËÉÊ);

?

Expected result:

ÒÕÓÓËÉÊ

Actual result:
--
english





-- 
Edit this bug report at http://bugs.php.net/?id=33898edit=1