Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Guy Harris Tue, 28 Jun 2011 10:30:15 -0700

On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote:

> On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris <g...@alum.mit.edu> wrote:
>>        1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
>> encoding (are you seeing the issue with Norwegian characters on your system? 
>>  If so, what's the setting of LANG?);
> 
> I only had issues with Norwegian characters in file names reported via
> simple_dialog(), and my LANG is empty.


OK, what OS are you using?  If it's a UN*X, try compiling and running the 
attached C program; does it print your name correctly on your terminal/terminal 
emulator (it writes it out in UTF-8), and does the file it creates (your name 
is its name - yeah, complete with a space between "Stig" and "Bjørlykke", and 
with no ".txt" at the end) have a name that shows up correctly if you do "ls"?  
If it's Windows, then you're probably just seeing bug 5715.

> Another problem is that we still have issues regarding UTF-8 strings
> in packets.  We should really fix that...

We have an issue regarding strings in packets in general.  Strings might be in 
a number of encodings, including ASCII (meaning that any byte with the 8th bit 
set is something that shouldn't be there), other national variants of ISO 646, 
UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with no 
surrogate pairs"), ISO 8859/x for various values of x, various ISO 2022-based 
encodings (e.g., the EUC encodings), various national standards, various DOS 
and Windows code pages, various Mac OS encodings, EBCDIC, whatever encodings 
are used for SMS, etc., etc., etc, etc.:

        http://en.wikipedia.org/wiki/Template:Character_encoding

I don't know whether all of the encodings in question can be mapped to Unicode 
without information loss.  An arbitrary string of octets definitely can't be 
mapped to UTF-8 without information loss; consider a putatively UTF-8-encoded 
string that contains an octet sequence that's not valid in UTF-8.

Perhaps, in the Wireshark dissection engine, we should initially store string 
values as a pair {encoding, counted octet string} (counted so that octets with 
the value 0 don't cause problems), and:

        when putting them into a textual representation of the protocol tree or 
into columns or something else to be shown to humans, map them to UTF-8, with 
anything that can't be mapped to UTF-8 - including, if the encoding is 
putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown as 
the Unicode replacement character U+FFFD;

        when comparing them in a display filter, attempt to map them to UTF-8 
(and save the result), and:

                if the mapping fails, treat *all* comparisons except for 
inequality as failing, and treat comparisons for inequality as succeeding;

                if the mapping succeeds, compare the two strings;

        when making them available to software inside *Shark (C/C++ code, Lua 
code, Python code, etc.), attempt to convert them to whatever the appropriate 
representation is (presumably UTF-8), and have the routines to fetch those 
values support returning a "conversion failed" indication (or perhaps offer 
both a "convert for display to humans" version that uses U+FFFD for failure and 
a "convert for processing" version that returns "can't do it" for failure).

Here's the program I mentioned above:

norsk.c
Description: Binary data

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev@wireshark.org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Reply via email to