Encode::is_utf8, v2.18, checking for well formed UTF-8, bug ? I understand that is_utf8(<string>, 1) will check whether the given string contains well-formed UTF-8 -- having forced the string to utf8.
Experiment shows that this does indeed reject strings that contain: - any invalid bytes, ie: * bytes 0x80..0xBF outside a sequences * bytes outside 0x80..0xBF inside a sequence - any redundant UTF-8 sequences, ie any sequence which is well-formed, but for which a shorter sequence exists. So far, so good. It also rejects all sequences in the range: U+0014_0000: 0: \xF5\x80\x80\x80 U+001F_FFFF: 0: \xF7\xBF\xBF\xBF But otherwise accepts all sequences between U+0080: \xC2\x80 and U+7FFF_FFFF: \xFD\xBF\xBF\xBF\xBF\xBF. I am content that the definition of utf8 allows for character values at least 0x00..0x7FFF_FFFF. But there is a hole in the range ! Bug ?? It would be useful to have a check that spots: - U+D800..U+DFFF -- nonsense values - U+FFFD -- though could be meaningful - U+FFFE -- though may be being used for BOM - U+FFFF -- not really expected - characters beyond U+10_FFFF -- nonsense values Running across either a byte string or an already utf8 string. A smart check could return a bit mask, so that one could detect the presence of each of the above cases (and others that I don't know of). Actually, could also spot BOM marker(s) ? I know that this can be done by decode/encode with UTF-8: - decode('UTF-8', string) inserts U+FFFD for: U+D800..U+DFFF, U+FFFF and anything beyond U+10_FFFF. It leaves U+FFFD and U+FFFE. To detect invalids one has to look in the decoded string for \x{FFFD} or \x{FFFE}. - encode('UTF-8', string, 1) will croak for U+FFFD for: U+D800..U+DFFF, U+FFFF and anything beyond U+10_FFFF. It leaves U+FFFD and U+FFFE. To detect those one has scan the encoded string. But we seem to be doing a lot of work here... and apparently copying strings around to no good effect. (Though, I guess that at some point one will have to decode the string, if it is valid 'UTF-8'.) Chris PS: I find that decode('UTF-8', string, sub { $n++ ; return '?' ; }) simply doesn't work ! That is, the embedded sub does not appear to be called, but decode seems to stop at the first error, and quietly give up, returning the partly decoded string. -- Chris Hall