Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-07-11 Thread Stephen Fisher
On Tue, Jun 28, 2011 at 10:01:14AM -0700, Guy Harris wrote:

 I don't know what the various terminal emulators for Windows, e.g. 
 cmd.exe, do.

The popular SecureCRT terminal emulator defaults to default (same as 
local system) character encoding, at least on Windows systems.  This is 
not compatible with UTF-8 in my experience.

___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-07-11 Thread Guy Harris

On Jul 11, 2011, at 4:00 PM, Stephen Fisher wrote:

 The popular SecureCRT terminal emulator defaults to default (same as 
 local system) character encoding, at least on Windows systems.  This is 
 not compatible with UTF-8 in my experience.

Not surprising, given that default/same as local system probably means 
local code page.  Win32 first appeared in NT 3.1 in 1993, and Unicode first 
appeared in 1991 (and Microsoft joined the group doing it in 1990, at least 
according to the Wikipedia article), so it could support Unicode from Day One, 
and they could get away with saying if you want Unicode you have to use the 
Unicode versions of the APIs, and strings are UCS-2 in those versions of the 
APIs, with the legacy ASCII/ANSI APIs using code pages.  UN*X didn't have 
that advantage, so UN*X systems support Unicode using UTF-8 rather than with 
Shiny New APIs.

So, on Windows, consoles, whether from Microsoft or third parties, probably 
tend to, if not using UCS-2/UTF-16 characters, use the local code page.  For 
what it's worth, the Wikipedia article on the Win32 console:

http://en.wikipedia.org/wiki/Win32_console

claims that

Under Windows NT and CE based versions of Windows, the screen buffer 
uses four bytes per character cell: two bytes for character code, two bytes for 
attributes. The character is then encoded a 16-bit subset of Unicode 
(UCS-2).[2] For backward compatibility, the console APIs exist in two versions: 
Unicode and non-Unicode. The non-Unicode versions of APIs can usecode page 
switching to extend the range of displayed characters (but only if TrueType 
fonts are used for the console window, thereby extending the range of codes 
available). Even UTF-8is available as code page 65001.

At least according to

http://msdn.microsoft.com/en-us/library/ms683458(v=VS.85).aspx

the device-independent I/O functions ReadFile() and WriteFile() (for UN*X 
folks, think read() and write()) don't support Unicode:

High-level I/O gives you a choice between the ReadFile and WriteFile 
functions and the ReadConsole and WriteConsole functions. They are identical, 
except for two important differences. The console functions support the use of 
either Unicode characters or the ANSI character set; the file I/O functions do 
not support Unicode. Also, the file I/O functions can be used to access files, 
pipes, and serial communications devices; the console functions can only be 
used with console handles. This distinction is important if an application 
relies on standard handles that may have been redirected.

and I suspect that the C library _read() and _write() functions, and the 
standard I/O library functions that are presumably built atop them, probably 
ultimately run atop ReadFile() and WriteFile(), so that they're 
device-independent.

On UN*X, you probably get similar behavior, *mutatis mutandis* (e.g., replacing 
the system code page setting with the code set portion of the setting of LANG 
or LC_CTYPE or whatever), so we can't guarantee, on Windows or UN*X, that what 
gets printed with printf() or fprintf() can always be done in UTF-8, so

1) we'd have to translate it to the appropriate character encoding

and

2) not all Unicode characters can necessarily be represented in that 
encoding.

In the best of all possible worlds, all UN*X systems would be configured to use 
UTF-8 encoding and all Windows systems would be configured to use code page 
65001, but
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-29 Thread Graham Bloice
On 28/06/2011 18:27, Guy Harris wrote:
 On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote:

 On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris g...@alum.mit.edu wrote:
1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
 encoding (are you seeing the issue with Norwegian characters on your 
 system?  If so, what's the setting of LANG?);
 I only had issues with Norwegian characters in file names reported via
 simple_dialog(), and my LANG is empty.
 OK, what OS are you using?  If it's a UN*X, try compiling and running the 
 attached C program; does it print your name correctly on your 
 terminal/terminal emulator (it writes it out in UTF-8), and does the file it 
 creates (your name is its name - yeah, complete with a space between Stig 
 and Bjørlykke, and with no .txt at the end) have a name that shows up 
 correctly if you do ls?  If it's Windows, then you're probably just seeing 
 bug 5715.

 Another problem is that we still have issues regarding UTF-8 strings
 in packets.  We should really fix that...
 We have an issue regarding strings in packets in general.  Strings might be 
 in a number of encodings, including ASCII (meaning that any byte with the 8th 
 bit set is something that shouldn't be there), other national variants of ISO 
 646, UTF-8, UTF-16, UCS-2 (meaning only the Basic Multilingual plane, with 
 no surrogate pairs), ISO 8859/x for various values of x, various ISO 
 2022-based encodings (e.g., the EUC encodings), various national standards, 
 various DOS and Windows code pages, various Mac OS encodings, EBCDIC, 
 whatever encodings are used for SMS, etc., etc., etc, etc.:

   http://en.wikipedia.org/wiki/Template:Character_encoding

 I don't know whether all of the encodings in question can be mapped to 
 Unicode without information loss.  An arbitrary string of octets definitely 
 can't be mapped to UTF-8 without information loss; consider a putatively 
 UTF-8-encoded string that contains an octet sequence that's not valid in 
 UTF-8.

 Perhaps, in the Wireshark dissection engine, we should initially store string 
 values as a pair {encoding, counted octet string} (counted so that octets 
 with the value 0 don't cause problems), and:

   when putting them into a textual representation of the protocol tree or 
 into columns or something else to be shown to humans, map them to UTF-8, with 
 anything that can't be mapped to UTF-8 - including, if the encoding is 
 putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown 
 as the Unicode replacement character U+FFFD;

   when comparing them in a display filter, attempt to map them to UTF-8 
 (and save the result), and:

   if the mapping fails, treat *all* comparisons except for 
 inequality as failing, and treat comparisons for inequality as succeeding;

   if the mapping succeeds, compare the two strings;

   when making them available to software inside *Shark (C/C++ code, Lua 
 code, Python code, etc.), attempt to convert them to whatever the appropriate 
 representation is (presumably UTF-8), and have the routines to fetch those 
 values support returning a conversion failed indication (or perhaps offer 
 both a convert for display to humans version that uses U+FFFD for failure 
 and a convert for processing version that returns can't do it for 
 failure).

 Here's the program I mentioned above:
For reference, here's the test executable output on Win7, using the SDK 7.0
build environment (a cmd.prompt):

c:\temptest
Stig Bj├©rlykke
Now creating a file with Stig's name as its name

c:\tempdir
 Volume in drive C has no label.
 Volume Serial Number is D845-44D4

 Directory of c:\temp

29/06/2011  10:30DIR  .
29/06/2011  10:30DIR  ..
29/06/2011  10:3017 Stig Bjørlykke
29/06/2011  10:2877,312 test.exe
   2 File(s) 77,329 bytes
   2 Dir(s)  65,078,947,840 bytes free

The output of the executable was the same using Powershell.

-- 
Regards,

Graham Bloice

___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-29 Thread Guy Harris

On Jun 29, 2011, at 2:37 AM, Graham Bloice wrote:

 For reference, here's the test executable output on Win7, using the SDK 7.0 
 build environment (a cmd.prompt):

Not surprisingly, it doesn't work.

Microsoft introduced Unicode support when they introduced Win32; as they were 
introducing a new API, they could make the versions of the API that support 
Unicode take UCS-2 (later UTF-16) strings as arguments.  They also offered 
ASCII versions, which took strings in the local code page as arguments.  This 
also applies to the C library's routines, such as open()/_open().

UN*X systems already had a well-established API when they introduced Unicode 
support, and they had what amounted to code pages (the various ISO 8859/x 
encodings, the EUC encodings, assorted other encodings); instead, they added a 
new code page, with UTF-8 encoding.

The program was written for UN*X, to test whether, in the user's locale, UTF-8 
strings work.  In Windows, the ASCII API it was using to create a file would 
take your local code page, not UTF-8, as the string encoding, and I suspect 
cmd.exe also expects ASCII output from programs - such as when the test 
program was printing Stig's name - to be in the local code page, not UTF-8.

This is why GLib has file functions that do mapping on file names; the page at

http://developer.gnome.org/glib/stable/glib-File-Utilities.html

says

There is a group of functions which wrap the common POSIX functions 
dealing with filenames (g_open(), g_rename(), g_mkdir(), g_stat(),g_unlink(), 
g_remove(), g_fopen(), g_freopen()). The point of these wrappers is to make it 
possible to handle file names with any Unicode characters in them on Windows 
without having to use ifdefs and the wide character API in the application code.

The pathname argument should be in the GLib file name encoding. On 
POSIX this is the actual on-disk encoding which might correspond to the locale 
settings of the process (or the G_FILENAME_ENCODING environment variable), or 
not.

On Windows the GLib file name encoding is UTF-8. Note that the 
Microsoft C library does not use UTF-8, but has separate APIs for current 
system code page and wide characters (UTF-16). The GLib wrappers call the wide 
character API if present (on modern Windows systems), otherwise convert to/from 
the system code page.

Another group of functions allows to open and read directories in the 
GLib file name encoding. These are g_dir_open(), 
g_dir_read_name(),g_dir_rewind(), g_dir_close().

This is also why we have our own copies of some of those functions on Windows, 
and wrap them ourselves (so that we don't require GLib 2.6, which introduced 
them, for all platforms).
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-29 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 7:01 PM, Guy Harris g...@alum.mit.edu wrote:
 In any case, that means that using strerror() is probably not going to be 
 sufficient to fix the problem.  What we might want to do is use UTF-8 
 everywhere we can, and, for non-GUI output, convert to the appropriate 
 character encoding - whatever that might be - at the last minute.

Ok, what about trying to convert back to locale when output error
messages from tshark?
Something like the attached patch, maybe?


-- 
Stig Bjørlykke


tshark-print-locale.patch
Description: Binary data
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-29 Thread Guy Harris

On Jun 29, 2011, at 1:45 PM, Stig Bjørlykke wrote:

 Ok, what about trying to convert back to locale when output error
 messages from tshark?
 Something like the attached patch, maybe?

Something like that, but with a g_free() of string afterwards. :-)
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Jakub Zawadzki
On Mon, Jun 27, 2011 at 05:58:35PM -0700, Guy Harris wrote:
  We have about 240 calls to strerror().
 
 ...and, unfortunately, a variant that converts to UTF-8 and is API-compatible 
 is non-trivial, 
 as any version that allocates a buffer for the result of the conversion would 
 leak memory 
 we just globally replaced strerror() with ws_strerror().

g_strerror() [1]?

Returns :
  a UTF-8 string describing the error code. If the error code is
  unknown, it returns unknown error (code). 
  The string can only be used until the next call to g_strerror()

[1] 
http://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-strerror
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 9:35 AM, Jakub Zawadzki
darkjames...@darkjames.pl wrote:
 g_strerror() ?

Yes, of course :)  Thank you.


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Graham Bloice
On 28/06/2011 01:58, Guy Harris wrote:

   2) Windows, where Unicode generally means UTF-16, and APIs that 
 return strings encoded as sequences of octets rather than hexadectets 
 probably return strings in the local code page.

Is this a first sighting of a new word hexadectet? Google doesn't have an
entry for it.

-- 
Regards,

Graham Bloice


___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Jakub Zawadzki
On Tue, Jun 28, 2011 at 10:14:34AM +0200, Stig Bj?rlykke wrote:
 On Tue, Jun 28, 2011 at 9:35 AM, Jakub Zawadzki
 darkjames...@darkjames.pl wrote:
  g_strerror() ?
 
 Yes, of course :)  Thank you.

no problem ;-)

Btw. I know that nowadays I'm the only one who uses non-utf locales on console,
but when we print on console (stdout/stderr) I think we should use strerror() 
from libc,
i.e. strerror() which don't recode message to utf-8.

but well, it's nothing very important...
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 12:22 PM, Jakub Zawadzki
darkjames...@darkjames.pl wrote:
 Btw. I know that nowadays I'm the only one who uses non-utf locales on 
 console,
 but when we print on console (stdout/stderr) I think we should use strerror() 
 from libc,
 i.e. strerror() which don't recode message to utf-8.

Do we always know where the error message is used?
I suspect file_open_error_message() is used both in GUI and tshark.


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris g...@alum.mit.edu wrote:
        1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
 encoding (are you seeing the issue with Norwegian characters on your system?  
 If so, what's the setting of LANG?);

I only had issues with Norwegian characters in file names reported via
simple_dialog(), and my LANG is empty.

Another problem is that we still have issues regarding UTF-8 strings
in packets.  We should really fix that...


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 2:25 AM, Graham Bloice wrote:

 On 28/06/2011 01:58, Guy Harris wrote:
 
  2) Windows, where Unicode generally means UTF-16, and APIs that 
 return strings encoded as sequences of octets rather than hexadectets 
 probably return strings in the local code page.
 
 Is this a first sighting of a new word hexadectet?

No.

 Google doesn't have an entry for it.

An entry where?

When I did a Google search for hexadectet, it assumed I meant hexadentate, 
but when I told it that I really did mean hexadectet, it found items such as

http://tools.ietf.org/id/draft-denog-v6ops-addresspartnaming-02.txt

4.7. Hexadectet

   Hexadectet is directly derived from IPv4's octet, thus techni-
   cally correct and probably convenient to get used to. On the other
   hand, it is much harder to pronounce.

and

http://www.imc.org/ietf-822/old-archive1/msg02577.html

from 1992:

I hasten to admit that modeling the path between the decoder and
the richtext parser as a hexadectet stream has its own problems,
mainly in that it makes the richtext parser harder to write (one
has to be careful before using any standard string manipulation
routines).

___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 3:22 AM, Jakub Zawadzki wrote:

 Btw. I know that nowadays I'm the only one who uses non-utf locales on 
 console,
 but when we print on console (stdout/stderr) I think we should use strerror() 
 from libc,
 i.e. strerror() which don't recode message to utf-8.

It's more complicated than that.

There are many source of strings in the non-GUI output of the programs in the 
Wireshark suite:

the message text itself - that's generally ASCII;

file names - internally to those programs, those are in UTF-8;

error strings for errno values and signal-name strings from signals - 
those might be in the current locale for strerror()/strsignal() and would be in 
UTF-8 with g_strerror()/g_strsignal();

etc.

In addition, the non-GUI output of the program can be sent either to the 
terminal or to files.

Output to the terminal should be in whatever character set the terminal 
expects.  I'm not sure what would indicate the character set the terminal 
expects.  On my machine, the terminal is Terminal.app, and can handle UTF-8 
output; on other UN*Xes, in the GUI, it's probably similar.  For consoles 
(which I'm using here to mean no GUI, just the console of a 
workstation/personal computer) it might be less capable.  For real terminals, 
it's almost certainly less-capable; I'm not sure whether there's ever be a real 
serial-port terminal that handles UTF-8.  I don't know what the various 
terminal emulators for Windows, e.g. cmd.exe, do.

Output to files, whether it's the result of redirecting the standard output or 
error of a command-line program to a file, or of one of the export to a text 
file operations in Wireshark, or..., is another matter.  It might be that the 
character encoding should be the same as would be used on a terminal.

In any case, that means that using strerror() is probably not going to be 
sufficient to fix the problem.  What we might want to do is use UTF-8 
everywhere we can, and, for non-GUI output, convert to the appropriate 
character encoding - whatever that might be - at the last minute.
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 3:33 AM, Stig Bjørlykke wrote:

 Do we always know where the error message is used?
 I suspect file_open_error_message() is used both in GUI and tshark.

Yes - it's in epan.
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote:

 On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris g...@alum.mit.edu wrote:
1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
 encoding (are you seeing the issue with Norwegian characters on your system? 
  If so, what's the setting of LANG?);
 
 I only had issues with Norwegian characters in file names reported via
 simple_dialog(), and my LANG is empty.

OK, what OS are you using?  If it's a UN*X, try compiling and running the 
attached C program; does it print your name correctly on your terminal/terminal 
emulator (it writes it out in UTF-8), and does the file it creates (your name 
is its name - yeah, complete with a space between Stig and Bjørlykke, and 
with no .txt at the end) have a name that shows up correctly if you do ls?  
If it's Windows, then you're probably just seeing bug 5715.

 Another problem is that we still have issues regarding UTF-8 strings
 in packets.  We should really fix that...

We have an issue regarding strings in packets in general.  Strings might be in 
a number of encodings, including ASCII (meaning that any byte with the 8th bit 
set is something that shouldn't be there), other national variants of ISO 646, 
UTF-8, UTF-16, UCS-2 (meaning only the Basic Multilingual plane, with no 
surrogate pairs), ISO 8859/x for various values of x, various ISO 2022-based 
encodings (e.g., the EUC encodings), various national standards, various DOS 
and Windows code pages, various Mac OS encodings, EBCDIC, whatever encodings 
are used for SMS, etc., etc., etc, etc.:

http://en.wikipedia.org/wiki/Template:Character_encoding

I don't know whether all of the encodings in question can be mapped to Unicode 
without information loss.  An arbitrary string of octets definitely can't be 
mapped to UTF-8 without information loss; consider a putatively UTF-8-encoded 
string that contains an octet sequence that's not valid in UTF-8.

Perhaps, in the Wireshark dissection engine, we should initially store string 
values as a pair {encoding, counted octet string} (counted so that octets with 
the value 0 don't cause problems), and:

when putting them into a textual representation of the protocol tree or 
into columns or something else to be shown to humans, map them to UTF-8, with 
anything that can't be mapped to UTF-8 - including, if the encoding is 
putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown as 
the Unicode replacement character U+FFFD;

when comparing them in a display filter, attempt to map them to UTF-8 
(and save the result), and:

if the mapping fails, treat *all* comparisons except for 
inequality as failing, and treat comparisons for inequality as succeeding;

if the mapping succeeds, compare the two strings;

when making them available to software inside *Shark (C/C++ code, Lua 
code, Python code, etc.), attempt to convert them to whatever the appropriate 
representation is (presumably UTF-8), and have the routines to fetch those 
values support returning a conversion failed indication (or perhaps offer 
both a convert for display to humans version that uses U+FFFD for failure and 
a convert for processing version that returns can't do it for failure).

Here's the program I mentioned above:



norsk.c
Description: Binary data
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 10:01 AM, Guy Harris wrote:

 In any case, that means that using strerror() is probably not going to be 
 sufficient to fix the problem.  What we might want to do is use UTF-8 
 everywhere we can, and, for non-GUI output, convert to the appropriate 
 character encoding - whatever that might be - at the last minute.

And then there's input.

Input from the GUI is in UTF-8.

We don't have any programs that read interactive user input from the command 
line, unless I've missed something, *but* we have programs that take arguments 
from the command line.  If you're typing commands interactively, those are 
presumably in the encoding of the terminal or terminal emulator.  If you're 
running commands from a script file, they're in whatever the character encoding 
is for the file.

I note that GLib, at least, appears to allow the file name character encoding 
to differ from the locale's character encoding.  I originally didn't see why 
this made sense, but I guess they might differ if, say, you're looking at 
somebody else's files and you're not both using UTF-8 and you're using 
different encodings, which could conceivably happen on UN*X.

(As


http://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-get-filename-charsets

notes, on Unix, regardless of the locale character set or G_FILENAME_ENCODING 
value, the actual file names present on a system might be in any random 
encoding or just gibberish, but I digress.)
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 10:27 AM, Guy Harris wrote:

   when putting them into a textual representation of the protocol tree or 
 into columns or something else to be shown to humans, map them to UTF-8, with 
 anything that can't be mapped to UTF-8 - including, if the encoding is 
 putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown 
 as the Unicode replacement character U+FFFD;

...and, for for display conversions, we might want to convert control 
characters to Control Pictures symbols (0x to 0x001F convert to 0x2400 to 
0x241f: ␀, ␁, etc. through ␟; 0x007F converts to 0x2421, i.e. ␡ - in the font 
in which this message is being displayed to me, those have the control 
character abbreviations displayed in really really small letters, diagonally 
from upper left to lower right; unfortunately, I see nothing for C1 control 
characters).
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 10:43 AM, Guy Harris wrote:

 On Jun 28, 2011, at 10:27 AM, Guy Harris wrote:
 
  when putting them into a textual representation of the protocol tree or 
 into columns or something else to be shown to humans, map them to UTF-8, 
 with anything that can't be mapped to UTF-8 - including, if the encoding is 
 putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown 
 as the Unicode replacement character U+FFFD;
 
 ...and, for for display conversions, we might want to convert control 
 characters to Control Pictures symbols (0x to 0x001F convert to 0x2400 
 to 0x241f: ␀, ␁, etc. through ␟; 0x007F converts to 0x2421, i.e. ␡ - in the 
 font in which this message is being displayed to me, those have the control 
 character abbreviations displayed in really really small letters, diagonally 
 from upper left to lower right; unfortunately, I see nothing for C1 control 
 characters).

http://en.wikipedia.org/wiki/Template:Unicode_chart_Control_Pictures

That claims that this is as of Unicode 6.0, so, if true, either they have a 
different name for control pictures for C1 control characters or there aren't 
any.  (I have no idea what those other symbols are doing in there.)

U+FFFD is often shown as a white question mark inside a black diamond:

http://en.wikipedia.org/wiki/Specials_(Unicode_block)

Oh, and if we're going to be extremely completist, there are the EBCDIC control 
characters, for which there are not always control pictures; see table 5.1:

ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt

This was from 1998.  I don't know whether any of the proposals were accepted.
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 10:27 AM, Guy Harris wrote:

 We have an issue regarding strings in packets in general.  Strings might be 
 in a number of encodings, including ASCII (meaning that any byte with the 8th 
 bit set is something that shouldn't be there), other national variants of ISO 
 646, UTF-8, UTF-16, UCS-2 (meaning only the Basic Multilingual plane, with 
 no surrogate pairs), ISO 8859/x for various values of x, various ISO 
 2022-based encodings (e.g., the EUC encodings), various national standards, 
 various DOS and Windows code pages, various Mac OS encodings, EBCDIC, 
 whatever encodings are used for SMS, etc., etc., etc, etc.:
 
   http://en.wikipedia.org/wiki/Template:Character_encoding

As long as I'm piling up a ton of information about humanity's twisty little 
maze of character encodings, all different:

SMS:

https://secure.wikimedia.org/wikipedia/en/wiki/GSM_03.38
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 7:27 PM, Guy Harris g...@alum.mit.edu wrote:
 OK, what OS are you using?

Snow:~ stig$ uname -a
Darwin Snow.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7
16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
Snow:~ stig$ echo $LANG

Snow:~ stig$ gcc norsk.c -o norsk  ./norsk
Stig Bjørlykke
Now creating a file with Stig's name as its name
Snow:~ stig$ ls -l Stig\ Bjørlykke
-rw-r--r--  1 stig  staff  16 Jun 28 21:20 Stig Bjørlykke

Everything works here.  I don't know anything about Windows.


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Guy Harris

On Jun 28, 2011, at 12:25 PM, Stig Bjørlykke wrote:

 On Tue, Jun 28, 2011 at 7:27 PM, Guy Harris g...@alum.mit.edu wrote:
 OK, what OS are you using?
 
 Snow:~ stig$ uname -a
 Darwin ...

Well, that answers *that* question. :-)

So the locale's encoding should probably be UTF-8, given that it's OS X.

However, if LANG is blank, you presumably don't have Terminal set up to Set 
local enviornment variables on startup (Preferences  Settings  Advanced, at 
the bottom); I think I turned that on a while ago, perhaps to get some UN*X 
software to correctly handle UTF-8.  Just out of curiosity, if you set that (or 
if you explicitly set LANG to something appropriate ending in .UTF-8, whether 
it's no_NO.UTF-8, nn_NO.UTF-8, nb_NO.UTF-8, en_NO.UTF-8, or some other 
setting), does that make the GUI problem go away with a version of Wireshark 
*without* the

http://anonsvn.wireshark.org/viewvc?revision=37812view=revision

changes?

(Whether it has other side-effects is another matter; it might, for example, 
affect the parsing and output of numbers and dates, for better or for worse.)
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-28 Thread Stig Bjørlykke
On Tue, Jun 28, 2011 at 9:37 PM, Guy Harris g...@alum.mit.edu wrote:
 However, if LANG is blank, you presumably don't have Terminal set up to Set 
 local enviornment variables on startup (Preferences  Settings  Advanced, 
 at the bottom);

Actually I have Set local environment variables on startup checked.
I also have Character encoding: Unicode (UTF-8).
I use English as my preferred language and Norway as region.

 Just out of curiosity, if you set that (or if you explicitly set LANG to 
 something appropriate ending in .UTF-8, whether it's no_NO.UTF-8, 
 nn_NO.UTF-8, nb_NO.UTF-8, en_NO.UTF-8, or some other setting), does that make 
 the GUI problem go away with a version of Wireshark *without* the

        http://anonsvn.wireshark.org/viewvc?revision=37812view=revision

 changes?

Normally I run Wireshark.app generated from 'make osx-install', and
getenv(LANG) returns .UTF-8.  No luck with rev  37812.

When running from command line with LANG=no_NO.UTF-8 I get this:
(process:65298): Gtk-WARNING **: Locale not supported by C library.
Using the fallback 'C' locale.

, but I get a correct error message with rev  37812 and æøå.pcap or
Проверка.pcap as filename.

So; if I run with a UTF-8 locale the g_locale_to_utf8() will not do
any conversion, and when running with a locale without UTF-8 (or not
legal) we get the error in bug 5715.

The bug was reported for Windows, but I don't know how it works there.
 I have tested on OSX and Ubuntu Linux.


Maybe we should include the locale in our about box?
We may use it in bug reports.


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

[Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-27 Thread Stig Bjørlykke
Hi.

When looking at bug 5715 I found that we use both UTF8 (from file
names) and locale (from strerror()) in the error messages presented
from simple_dialog().  In vsimple_dialog() we convert all messages
with g_locale_to_utf8(), which will wrongly convert the file name
(like in the bug report).  When using Norwegian characters in the file
name the text in the dialog is empty.

Any ideas how we should fix this?  Convert all messages from
strerror() when putting the text into the error string and remove the
conversion in vsimple_dialog()?
We have about 240 calls to strerror().

https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=5715


-- 
Stig Bjørlykke
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe


Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

2011-06-27 Thread Guy Harris

On Jun 27, 2011, at 11:54 AM, Stig Bjørlykke wrote:

 When looking at bug 5715 I found that we use both UTF8 (from file
 names) and locale (from strerror()) in the error messages presented
 from simple_dialog().  In vsimple_dialog() we convert all messages
 with g_locale_to_utf8(), which will wrongly convert the file name
 (like in the bug report).  When using Norwegian characters in the file
 name the text in the dialog is empty.

I suspect this wouldn't be an issue on my machine, given that if, on my 
machine, g_locale_to_utf8() behaves differently from strcpy(), there's either a 
misconfiguration or a bug in g_locale_to_utf8():

$ echo $LANG
en_US.UTF-8

I.e., this issue should, modulo bugs, only show up in locales where the 
character encoding isn't UTF-8, meaning:

1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the 
encoding (are you seeing the issue with Norwegian characters on your system?  
If so, what's the setting of LANG?);

2) Windows, where Unicode generally means UTF-16, and APIs that 
return strings encoded as sequences of octets rather than hexadectets probably 
return strings in the local code page.

 Any ideas how we should fix this?  Convert all messages from
 strerror() when putting the text into the error string and remove the
 conversion in vsimple_dialog()?

I would say yes, given that GTK+ uses UTF-8 as the string encoding for all 
GUI functions, and I think any other toolkit we might use as an alternative 
would also use some encoding of Unicode (UTF-8 or UTF-16, most likely).

 We have about 240 calls to strerror().

...and, unfortunately, a variant that converts to UTF-8 and is API-compatible 
is non-trivial, as any version that allocates a buffer for the result of the 
conversion would leak memory we just globally replaced strerror() with 
ws_strerror().

(Of course, strerror() is also not thread-safe, so there might be other reasons 
to avoid routines with such an API; the latest shiniest Single UNIX 
Specification has strerror_r(), which takes a buffer that it fills in, which 
has its own issues (as in how big a buffer do you need?), and I don't know 
how many platforms have it.

But if you're doing enough calls to strerror() that throwing a mutex around 
strerror() in your wrapper causes performance problems, those performance 
problems are probably the least of your problems)
___
Sent via:Wireshark-dev mailing list wireshark-dev@wireshark.org
Archives:http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
 mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe