Re: [PHP] substr and UTF-8
On Wed, 30 Aug 2006 21:46:18 +0700 "Peter Lauri" <[EMAIL PROTECTED]> wrote: > function is_utf8_start($b) { > return (($b & 0x80) == 0) || ($b & 0x40); > } > [/snip] > > :) I think I will go with the mb_substr function, it works for me :) Yeah, I guess that's the right thing to do. Otherwise, in a year you won't remember what the cryptic masking is all about. Mike -- Michael B Allen PHP Active Directory SSO http://www.ioplex.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] substr and UTF-8
[snip] Actually this is false. I don't know what I was thinking. The high bit will be set in all bytes of a UTF-8 byte sequence. If it's not it's an ASCII character. The bytes are actually layed out as follows [1]: U- ___ U-007F: 0xxx U-0080 ___ U-07FF: 110x 10xx U-0800 ___ U-: 1110 10xx 10xx U-0001 ___ U-001F: 0xxx 10xx 10xx 10xx So there's no way to tell the last byte of a UTF-8 byte sequence but you can tell if it's the first byt looking at bits 7 and 8. Specifically, if bit 8 is not on, the character is ASCII and thus the "start" of a new character. Otherwise, if bit 7 is on it's the start of a new UTF-8 byte sequence. function is_utf8_start($b) { return (($b & 0x80) == 0) || ($b & 0x40); } [/snip] :) I think I will go with the mb_substr function, it works for me :) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] substr and UTF-8
On Wed, 30 Aug 2006 10:08:36 -0400 Michael B Allen <[EMAIL PROTECTED]> wrote: > On Wed, 30 Aug 2006 18:34:20 +0700 > "Peter Lauri" <[EMAIL PROTECTED]> wrote: > > > Hi group, > > > > I want to limit the number of characters that are shown in a script. The > > characters happen to be Thai, and the page is encoded in UTF-8. Everything > > works, except when I want to cut the text (just take start of string). > > > > I do: > > > > echo substr($thaistring, 0, 30); > > > > The beginning of the string works fine, but the last character does mostly > > "break". How can I determine the start and end of a character. > > The last byte of a UTF-8 character does not have bit 8 set whereas all > preceeding bytes do. Actually this is false. I don't know what I was thinking. The high bit will be set in all bytes of a UTF-8 byte sequence. If it's not it's an ASCII character. The bytes are actually layed out as follows [1]: U- ___ U-007F: 0xxx U-0080 ___ U-07FF: 110x 10xx U-0800 ___ U-: 1110 10xx 10xx U-0001 ___ U-001F: 0xxx 10xx 10xx 10xx So there's no way to tell the last byte of a UTF-8 byte sequence but you can tell if it's the first byt looking at bits 7 and 8. Specifically, if bit 8 is not on, the character is ASCII and thus the "start" of a new character. Otherwise, if bit 7 is on it's the start of a new UTF-8 byte sequence. function is_utf8_start($b) { return (($b & 0x80) == 0) || ($b & 0x40); } Mike [1] http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 -- Michael B Allen PHP Active Directory SSO http://www.ioplex.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] substr and UTF-8
On Wed, 30 Aug 2006 18:34:20 +0700 "Peter Lauri" <[EMAIL PROTECTED]> wrote: > Hi group, > > I want to limit the number of characters that are shown in a script. The > characters happen to be Thai, and the page is encoded in UTF-8. Everything > works, except when I want to cut the text (just take start of string). > > I do: > > echo substr($thaistring, 0, 30); > > The beginning of the string works fine, but the last character does mostly > "break". How can I determine the start and end of a character. The last byte of a UTF-8 character does not have bit 8 set whereas all preceeding bytes do. Mike -- Michael B Allen PHP Active Directory SSO http://www.ioplex.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] substr and UTF-8
Peter Lauri wrote: > Hi group, > > I want to limit the number of characters that are shown in a script. The > characters happen to be Thai, and the page is encoded in UTF-8. Everything > works, except when I want to cut the text (just take start of string). > > I do: > > echo substr($thaistring, 0, 30); > > The beginning of the string works fine, but the last character does mostly > "break". How can I determine the start and end of a character. become familiar with (and install) the mb_string extension > > I hope the problem is clear enough, is it? :) > > Best regards, > Peter Lauri > > www.lauri.se - personal web site > www.dwsasia.com - company web site > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] substr and UTF-8
Hi group, I want to limit the number of characters that are shown in a script. The characters happen to be Thai, and the page is encoded in UTF-8. Everything works, except when I want to cut the text (just take start of string). I do: echo substr($thaistring, 0, 30); The beginning of the string works fine, but the last character does mostly "break". How can I determine the start and end of a character. I hope the problem is clear enough, is it? :) Best regards, Peter Lauri www.lauri.se - personal web site www.dwsasia.com - company web site -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php