On Wed, 2007-03-28 at 23:59 -0700, Erick Tryzelaar wrote:
> skaller wrote:
> > Ops like trim/strip will simply miss some whitespaces, they won't
> > do the wrong thing provided they treat high bit set chars as non-space.
> >
>
> That's only if you're using trim/strip to remove whitespace. You can
> also use them to trim substrings, or strip all the chars in a string.
> Those won't work on unicode.
I think they will -- UTF8 is designed to allow that.
That is: if you say want to remove a substring S from T,
then searching T for S will never find a wrong substring --
unless S is an illegal substring.
for example if you search T for S = 0x89 or something,
then you will certainly mess up .. but 0x89 isn't a legal
utf8 character, so what do you expect?
Look at the encoding rules: I claim the following
rule holds: given any two unicode characters
with UTF8 encodings (for example):
c0 c1 c2 c3 d0 d1 d2 d3
then if any subsequence of the above string is a legal UTF8 char,
it must be c0 c1 c2 c3 or d0 d1 d2 d3 .. no other subsequence
of the above string is a legal encoding.
A simpler invariant is: there is no UTF8 encoding starting
with or c1, c2, c3, indeed there is no UTF8 encoding where
any character is out of sync. If you hit a byte
0xXX
then XX tells you which byte of the multibyte sequence it is.
So given any pointer into a utf8 string you can find the
first byte of the sequence it is in the middle of:
The ranges of the bytes in each sequence position are exclusive.
So roughly speaking, ANY 'nice' stream operation on 8 bit chars will
also work for UTF-8 and preserve semantics (character meaning).
I won't define 'nice' here, but it includes not only searching
but also sorting.
--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Felix-language mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/felix-language