On Sat, Mar 23, 2013 at 19:16:57 -0000, Mark wrote:
> I looked at a hex dump of the test_star.tar archive. For all files except
> the ...78.bin file, the o-umlaut character is represented by two bytes:
> 0xC3 0xB6. For the ...78.bin file the o-umlauts are represented by C3 83
> C2 B6 (see offsets 0x0C69 and 0x0C9D in the file).

It doesn't explain anything about why it's happening in the first place,
but I did notice that four-byte string appears to be the result of some
sort of double latin1 -> UTF-8 conversion.

That is, the o-umlaut character in latin1 is the F6 byte; when
represented in UTF-8 that expands to the two bytes C3 B6.

Those bytes, if then treated as latin1 characters instead of UTF-8 for
some reason, would display as "ö", and after another round of latin1 ->
UTF-8 conversion, would end up as C3 83 C2 B6....


                                                        Nathan

----------------------------------------------------------------------------
Nathan Stratton Treadway  -  [email protected]  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239

Reply via email to