What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Costello, Roger L.
Hi Folks,

Suppose that these hex bytes:

C3 83 C2 B1 

show up in a message and the message contains no hint what its encoding is. 

Perhaps it is 8859-1, in which case the message consists of four 1-byte 
characters: 

C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±

Perhaps it is UTF-8, in which case the message consists of two 2-byte 
characters:

C383 = 쎃
C2B1 = 슱

Or, perhaps it is some other encoding.

What does one do in such a situation?

/Roger




RE: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Whistler, Ken


 Suppose that these hex bytes:
 
   C3 83 C2 B1
 
 show up in a message and the message contains no hint what its encoding is.
 
 Perhaps it is 8859-1, in which case the message consists of four 1-byte
 characters:
 
 C3 = Ã
 83 = the “no break here” character
 C2 = Â
 B1 = ±
 
 Perhaps it is UTF-8, in which case the message consists of two 2-byte
 characters:
 
 C383 = 쎃
 C2B1 = 슱

Actually, that would be interpreting it as UTF-16, not as UTF-8. That
can probably be quickly ruled out if the rest of the text is not obviously
in UTF-16.

Interpreted as UTF-8, it would be:

C3 83 -- U+00C3 = Ã
C2 B1 -- U+00B1 = ±

More likely than the other two alternatives you cite.

Of course, you also have to consider serial corruptions as a possibility.

It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ.

Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted
as UTF-8 again.

--Ken





Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Karl Williamson

On 07/19/2013 11:51 AM, Costello, Roger L. wrote:

Hi Folks,

Suppose that these hex bytes:

C3 83 C2 B1

show up in a message and the message contains no hint what its encoding is.

Perhaps it is 8859-1, in which case the message consists of four 1-byte 
characters:

C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±

Perhaps it is UTF-8, in which case the message consists of two 2-byte 
characters:

C383 = 쎃
C2B1 = 슱



That's not how UTF-8 works.  Instead in UTF-8 it would be:

 C3 83 = LATIN CAPITAL LETTER A WITH TILDE
 C2 B1 = PLUS-MINUS SIGN

It's unlikely that any other encoding will pass a UTF-8 validity test 
for inputs longer than just a few bytes.  So you can rule-in or rule-out 
UTF-8 fairly easily.  You can also look for BOMs to get UTF-16 and UTF-32.


After that, there are various heuristics that can be applied, and people 
have written things that attempt to guess encodings.  An example from 
Perl is

http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm
but it requires a list of possible encodings that it experiments with.


Or, perhaps it is some other encoding.

What does one do in such a situation?

/Roger







Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Peter Edberg

On Jul 19, 2013, at 12:42 PM, Mark Davis ☕ m...@macchiato.com wrote:

 Popping up a level.
 
 ICU (and some other libraries) have heuristic encoding detection, that will 
 take a sequence of bytes and come up with a likely encoding id.

However, the ICU encoding detection typically requires more than 4 bytes 
(usually at least 10 characters worth of bytes) in order to make a reasonable 
guess.

- Peter E

 
 
 Mark
 
 — Il meglio è l’inimico del bene —
 
 
 On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote:
 
 
  Suppose that these hex bytes:
 
C3 83 C2 B1
 
  show up in a message and the message contains no hint what its encoding is.
 
  Perhaps it is 8859-1, in which case the message consists of four 1-byte
  characters:
 
  C3 = Ã
  83 = the “no break here” character
  C2 = Â
  B1 = ±
 
  Perhaps it is UTF-8, in which case the message consists of two 2-byte
  characters:
 
  C383 = 쎃
  C2B1 = 슱
 
 Actually, that would be interpreting it as UTF-16, not as UTF-8. That
 can probably be quickly ruled out if the rest of the text is not obviously
 in UTF-16.
 
 Interpreted as UTF-8, it would be:
 
 C3 83 -- U+00C3 = Ã
 C2 B1 -- U+00B1 = ±
 
 More likely than the other two alternatives you cite.
 
 Of course, you also have to consider serial corruptions as a possibility.
 
 It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ.
 
 Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted
 as UTF-8 again.
 
 --Ken
 
 
 
 



Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Mark Davis ☕
Popping up a level.

ICU (and some other libraries) have heuristic encoding detection, that will
take a sequence of bytes and come up with a likely encoding id.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote:



  Suppose that these hex bytes:
 
C3 83 C2 B1
 
  show up in a message and the message contains no hint what its encoding
 is.
 
  Perhaps it is 8859-1, in which case the message consists of four 1-byte
  characters:
 
  C3 = Ã
  83 = the “no break here” character
  C2 = Â
  B1 = ±
 
  Perhaps it is UTF-8, in which case the message consists of two 2-byte
  characters:
 
  C383 = 쎃
  C2B1 = 슱

 Actually, that would be interpreting it as UTF-16, not as UTF-8. That
 can probably be quickly ruled out if the rest of the text is not obviously
 in UTF-16.

 Interpreted as UTF-8, it would be:

 C3 83 -- U+00C3 = Ã
 C2 B1 -- U+00B1 = ±

 More likely than the other two alternatives you cite.

 Of course, you also have to consider serial corruptions as a possibility.

 It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ.

 Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted
 as UTF-8 again.

 --Ken