Re: [PHP] substr and UTF-8

2006-08-30 Thread Michael B Allen
On Wed, 30 Aug 2006 21:46:18 +0700
"Peter Lauri" <[EMAIL PROTECTED]> wrote:

>   function is_utf8_start($b) {
>   return (($b & 0x80) == 0) || ($b & 0x40);
>   }
> [/snip]
> 
> :) I think I will go with the mb_substr function, it works for me :)

Yeah, I guess that's the right thing to do. Otherwise, in a year you
won't remember what the cryptic masking is all about.

Mike

-- 
Michael B Allen
PHP Active Directory SSO
http://www.ioplex.com/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP] substr and UTF-8

2006-08-30 Thread Peter Lauri
[snip]
Actually this is false. I don't know what I was thinking. The high bit
will be set in all bytes of a UTF-8 byte sequence. If it's not it's an
ASCII character.

The bytes are actually layed out as follows [1]:

U- ___ U-007F:  0xxx
U-0080 ___ U-07FF:  110x 10xx
U-0800 ___ U-:  1110 10xx 10xx
U-0001 ___ U-001F:  0xxx 10xx 10xx 10xx

So there's no way to tell the last byte of a UTF-8 byte sequence but you
can tell if it's the first byt looking at bits 7 and 8. Specifically,
if bit 8 is not on, the character is ASCII and thus the "start" of a
new character. Otherwise, if bit 7 is on it's the start of a new UTF-8
byte sequence.

  function is_utf8_start($b) {
  return (($b & 0x80) == 0) || ($b & 0x40);
  }
[/snip]

:) I think I will go with the mb_substr function, it works for me :)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] substr and UTF-8

2006-08-30 Thread Michael B Allen
On Wed, 30 Aug 2006 10:08:36 -0400
Michael B Allen <[EMAIL PROTECTED]> wrote:

> On Wed, 30 Aug 2006 18:34:20 +0700
> "Peter Lauri" <[EMAIL PROTECTED]> wrote:
> 
> > Hi group,
> > 
> > I want to limit the number of characters that are shown in a script. The
> > characters happen to be Thai, and the page is encoded in UTF-8. Everything
> > works, except when I want to cut the text (just take start of string).
> > 
> > I do:
> > 
> > echo substr($thaistring, 0, 30);
> > 
> > The beginning of the string works fine, but the last character does mostly
> > "break". How can I determine the start and end of a character.
> 
> The last byte of a UTF-8 character does not have bit 8 set whereas all
> preceeding bytes do.

Actually this is false. I don't know what I was thinking. The high bit
will be set in all bytes of a UTF-8 byte sequence. If it's not it's an
ASCII character.

The bytes are actually layed out as follows [1]:

U- ___ U-007F:  0xxx
U-0080 ___ U-07FF:  110x 10xx
U-0800 ___ U-:  1110 10xx 10xx
U-0001 ___ U-001F:  0xxx 10xx 10xx 10xx

So there's no way to tell the last byte of a UTF-8 byte sequence but you
can tell if it's the first byt looking at bits 7 and 8. Specifically,
if bit 8 is not on, the character is ASCII and thus the "start" of a
new character. Otherwise, if bit 7 is on it's the start of a new UTF-8
byte sequence.

  function is_utf8_start($b) {
  return (($b & 0x80) == 0) || ($b & 0x40);
  }

Mike

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

-- 
Michael B Allen
PHP Active Directory SSO
http://www.ioplex.com/

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] substr and UTF-8

2006-08-30 Thread Michael B Allen
On Wed, 30 Aug 2006 18:34:20 +0700
"Peter Lauri" <[EMAIL PROTECTED]> wrote:

> Hi group,
> 
> I want to limit the number of characters that are shown in a script. The
> characters happen to be Thai, and the page is encoded in UTF-8. Everything
> works, except when I want to cut the text (just take start of string).
> 
> I do:
> 
> echo substr($thaistring, 0, 30);
> 
> The beginning of the string works fine, but the last character does mostly
> "break". How can I determine the start and end of a character.

The last byte of a UTF-8 character does not have bit 8 set whereas all
preceeding bytes do.

Mike

-- 
Michael B Allen
PHP Active Directory SSO
http://www.ioplex.com/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] substr and UTF-8

2006-08-30 Thread Jochem Maas
Peter Lauri wrote:
> Hi group,
> 
> I want to limit the number of characters that are shown in a script. The
> characters happen to be Thai, and the page is encoded in UTF-8. Everything
> works, except when I want to cut the text (just take start of string).
> 
> I do:
> 
> echo substr($thaistring, 0, 30);
> 
> The beginning of the string works fine, but the last character does mostly
> "break". How can I determine the start and end of a character.

become familiar with (and install) the mb_string extension

> 
> I hope the problem is clear enough, is it? :)
> 
> Best regards,
> Peter Lauri
> 
> www.lauri.se - personal web site
> www.dwsasia.com - company web site
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] substr and UTF-8

2006-08-30 Thread Peter Lauri
Hi group,

I want to limit the number of characters that are shown in a script. The
characters happen to be Thai, and the page is encoded in UTF-8. Everything
works, except when I want to cut the text (just take start of string).

I do:

echo substr($thaistring, 0, 30);

The beginning of the string works fine, but the last character does mostly
"break". How can I determine the start and end of a character.

I hope the problem is clear enough, is it? :)

Best regards,
Peter Lauri

www.lauri.se - personal web site
www.dwsasia.com - company web site

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php