What does one do if the encoding is unknown and all you have is a sequence of bytes?
Hi Folks, Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 Or, perhaps it is some other encoding. What does one do in such a situation? /Roger
RE: What does one do if the encoding is unknown and all you have is a sequence of bytes?
Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 Actually, that would be interpreting it as UTF-16, not as UTF-8. That can probably be quickly ruled out if the rest of the text is not obviously in UTF-16. Interpreted as UTF-8, it would be: C3 83 -- U+00C3 = Ã C2 B1 -- U+00B1 = ± More likely than the other two alternatives you cite. Of course, you also have to consider serial corruptions as a possibility. It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ. Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted as UTF-8 again. --Ken
Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?
On 07/19/2013 11:51 AM, Costello, Roger L. wrote: Hi Folks, Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 That's not how UTF-8 works. Instead in UTF-8 it would be: C3 83 = LATIN CAPITAL LETTER A WITH TILDE C2 B1 = PLUS-MINUS SIGN It's unlikely that any other encoding will pass a UTF-8 validity test for inputs longer than just a few bytes. So you can rule-in or rule-out UTF-8 fairly easily. You can also look for BOMs to get UTF-16 and UTF-32. After that, there are various heuristics that can be applied, and people have written things that attempt to guess encodings. An example from Perl is http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm but it requires a list of possible encodings that it experiments with. Or, perhaps it is some other encoding. What does one do in such a situation? /Roger
Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?
On Jul 19, 2013, at 12:42 PM, Mark Davis ☕ m...@macchiato.com wrote: Popping up a level. ICU (and some other libraries) have heuristic encoding detection, that will take a sequence of bytes and come up with a likely encoding id. However, the ICU encoding detection typically requires more than 4 bytes (usually at least 10 characters worth of bytes) in order to make a reasonable guess. - Peter E Mark — Il meglio è l’inimico del bene — On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote: Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 Actually, that would be interpreting it as UTF-16, not as UTF-8. That can probably be quickly ruled out if the rest of the text is not obviously in UTF-16. Interpreted as UTF-8, it would be: C3 83 -- U+00C3 = Ã C2 B1 -- U+00B1 = ± More likely than the other two alternatives you cite. Of course, you also have to consider serial corruptions as a possibility. It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ. Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted as UTF-8 again. --Ken
Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?
Popping up a level. ICU (and some other libraries) have heuristic encoding detection, that will take a sequence of bytes and come up with a likely encoding id. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote: Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 Actually, that would be interpreting it as UTF-16, not as UTF-8. That can probably be quickly ruled out if the rest of the text is not obviously in UTF-16. Interpreted as UTF-8, it would be: C3 83 -- U+00C3 = Ã C2 B1 -- U+00B1 = ± More likely than the other two alternatives you cite. Of course, you also have to consider serial corruptions as a possibility. It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ. Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted as UTF-8 again. --Ken