Paul Bijnens <[EMAIL PROTECTED]> writes:
>I have a program that reads and writes (among others) strings that
>should be utf8 encoded.  I say "should", because somewhere deep
>inside the dark corners of that program, sometimes, the utf8 flag on
>a string is lost. (I'm still investigating where, tips to attack
>such a problem are welcome.)
>
>When writing the string, the program clears the utf8 flag
>and writes a simple string of octets using:
>
>     $s = encode("utf8", $s) if $s =~ /[^\x00-\x7f]/;

Fine, it it has high bits then UTF-8 from is different from 
"as characters" form.

>
>Why would someone test for pure 7-bit strings instead of:
>
>     $s = encode("utf8", $s) if Encode::is_utf8($s);

That just tells you if flag is on. If the string contains 
chars in range 0x80..0xff then is_utf8 _may_ be false
but you still need to do the encode.

>
>which seems superior to avoid double utf8 encodings,
>should the utf8-flag be lost.  And it's faster.
>
>Or even simply:     Encode::_utf8_off($s)

That works ONLY if is_utf8 was true.
If string was litteral "\x80" with flag off that isn't UTF-8 

>
>The problem is that I'm usually wrong.  Am I this time?
>Am I missing something?  Or do I need more coffee?
>
>
>-- 
>Paul Bijnens, Xplanation                            Tel  +32 16 397.511
>Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
>http://www.xplanation.com/          email:  [EMAIL PROTECTED]
>***********************************************************************
>* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
>* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
>* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
>* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
>* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
>* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
>***********************************************************************

Reply via email to