RE: [PHP] substr and UTF-8

Peter Lauri Wed, 30 Aug 2006 07:49:13 -0700

[snip]
Actually this is false. I don't know what I was thinking. The high bit
will be set in all bytes of a UTF-8 byte sequence. If it's not it's an
ASCII character.


The bytes are actually layed out as follows [1]:

U-00000000 ___ U-0000007F:      0xxxxxxx
U-00000080 ___ U-000007FF:      110xxxxx 10xxxxxx
U-00000800 ___ U-0000FFFF:      1110xxxx 10xxxxxx 10xxxxxx
U-00010000 ___ U-001FFFFF:      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So there's no way to tell the last byte of a UTF-8 byte sequence but you
can tell if it's the first byt looking at bits 7 and 8. Specifically,
if bit 8 is not on, the character is ASCII and thus the "start" of a
new character. Otherwise, if bit 7 is on it's the start of a new UTF-8
byte sequence.

  function is_utf8_start($b) {
      return (($b & 0x80) == 0) || ($b & 0x40);
  }
[/snip]

:) I think I will go with the mb_substr function, it works for me :)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] substr and UTF-8

Reply via email to