Paul Bijnens <[EMAIL PROTECTED]> writes: >I have a program that reads and writes (among others) strings that >should be utf8 encoded. I say "should", because somewhere deep >inside the dark corners of that program, sometimes, the utf8 flag on >a string is lost. (I'm still investigating where, tips to attack >such a problem are welcome.) > >When writing the string, the program clears the utf8 flag >and writes a simple string of octets using: > > $s = encode("utf8", $s) if $s =~ /[^\x00-\x7f]/;
Fine, it it has high bits then UTF-8 from is different from "as characters" form. > >Why would someone test for pure 7-bit strings instead of: > > $s = encode("utf8", $s) if Encode::is_utf8($s); That just tells you if flag is on. If the string contains chars in range 0x80..0xff then is_utf8 _may_ be false but you still need to do the encode. > >which seems superior to avoid double utf8 encodings, >should the utf8-flag be lost. And it's faster. > >Or even simply: Encode::_utf8_off($s) That works ONLY if is_utf8 was true. If string was litteral "\x80" with flag off that isn't UTF-8 > >The problem is that I'm usually wrong. Am I this time? >Am I missing something? Or do I need more coffee? > > >-- >Paul Bijnens, Xplanation Tel +32 16 397.511 >Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512 >http://www.xplanation.com/ email: [EMAIL PROTECTED] >*********************************************************************** >* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, * >* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * >* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * >* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * >* kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * >* ... "Are you sure?" ... YES ... Phew ... I'm out * >***********************************************************************