Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 10:01:14AM -0700, Guy Harris wrote: > I don't know what the various terminal emulators for Windows, e.g. > cmd.exe, do. The popular SecureCRT terminal emulator defaults to "default" (same as local system) character encoding, at least on Windows systems. This is not compatible with UTF-8 in my experience. ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jul 11, 2011, at 4:00 PM, Stephen Fisher wrote: > The popular SecureCRT terminal emulator defaults to "default" (same as > local system) character encoding, at least on Windows systems. This is > not compatible with UTF-8 in my experience. Not surprising, given that "default"/"same as local system" probably means "local code page". Win32 first appeared in NT 3.1 in 1993, and Unicode first appeared in 1991 (and Microsoft joined the group doing it in 1990, at least according to the Wikipedia article), so it could support Unicode from Day One, and they could get away with saying "if you want Unicode you have to use the Unicode versions of the APIs, and strings are UCS-2 in those versions of the APIs", with the legacy "ASCII"/"ANSI" APIs using code pages. UN*X didn't have that advantage, so UN*X systems support Unicode using UTF-8 rather than with Shiny New APIs. So, on Windows, consoles, whether from Microsoft or third parties, probably tend to, if not using UCS-2/UTF-16 characters, use the local code page. For what it's worth, the Wikipedia article on the Win32 console: http://en.wikipedia.org/wiki/Win32_console claims that Under Windows NT and CE based versions of Windows, the screen buffer uses four bytes per character cell: two bytes for character code, two bytes for attributes. The character is then encoded a 16-bit subset of Unicode (UCS-2).[2] For backward compatibility, the console APIs exist in two versions: Unicode and non-Unicode. The non-Unicode versions of APIs can usecode page switching to extend the range of displayed characters (but only if TrueType fonts are used for the console window, thereby extending the range of codes available). Even UTF-8is available as "code page 65001". At least according to http://msdn.microsoft.com/en-us/library/ms683458(v=VS.85).aspx the device-independent I/O functions ReadFile() and WriteFile() (for UN*X folks, think read() and write()) don't support Unicode: High-level I/O gives you a choice between the ReadFile and WriteFile functions and the ReadConsole and WriteConsole functions. They are identical, except for two important differences. The console functions support the use of either Unicode characters or the ANSI character set; the file I/O functions do not support Unicode. Also, the file I/O functions can be used to access files, pipes, and serial communications devices; the console functions can only be used with console handles. This distinction is important if an application relies on standard handles that may have been redirected. and I suspect that the C library _read() and _write() functions, and the "standard I/O library" functions that are presumably built atop them, probably ultimately run atop ReadFile() and WriteFile(), so that they're device-independent. On UN*X, you probably get similar behavior, *mutatis mutandis* (e.g., replacing "the system code page setting" with the code set portion of the setting of LANG or LC_CTYPE" or whatever), so we can't guarantee, on Windows or UN*X, that what gets printed with printf() or fprintf() can always be done in UTF-8, so 1) we'd have to translate it to the appropriate character encoding and 2) not all Unicode characters can necessarily be represented in that encoding. In the best of all possible worlds, all UN*X systems would be configured to use UTF-8 encoding and all Windows systems would be configured to use code page 65001, but ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 27, 2011, at 11:54 AM, Stig Bjørlykke wrote: > When looking at bug 5715 I found that we use both UTF8 (from file > names) and locale (from strerror()) in the error messages presented > from simple_dialog(). In vsimple_dialog() we convert all messages > with g_locale_to_utf8(), which will wrongly convert the file name > (like in the bug report). When using Norwegian characters in the file > name the text in the dialog is empty. I suspect this wouldn't be an issue on my machine, given that if, on my machine, g_locale_to_utf8() behaves differently from strcpy(), there's either a misconfiguration or a bug in g_locale_to_utf8(): $ echo $LANG en_US.UTF-8 I.e., this issue should, modulo bugs, only show up in locales where the character encoding isn't UTF-8, meaning: 1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the encoding (are you seeing the issue with Norwegian characters on your system? If so, what's the setting of LANG?); 2) Windows, where "Unicode" generally means "UTF-16", and APIs that return strings encoded as sequences of octets rather than hexadectets probably return strings in the local code page. > Any ideas how we should fix this? Convert all messages from > strerror() when putting the text into the error string and remove the > conversion in vsimple_dialog()? I would say "yes", given that GTK+ uses UTF-8 as the string encoding for all GUI functions, and I think any other toolkit we might use as an alternative would also use some encoding of Unicode (UTF-8 or UTF-16, most likely). > We have about 240 calls to strerror(). ...and, unfortunately, a variant that converts to UTF-8 and is API-compatible is non-trivial, as any version that allocates a buffer for the result of the conversion would leak memory we just globally replaced strerror() with ws_strerror(). (Of course, strerror() is also not thread-safe, so there might be other reasons to avoid routines with such an API; the latest shiniest Single UNIX Specification has strerror_r(), which takes a buffer that it fills in, which has its own issues (as in "how big a buffer do you need"?), and I don't know how many platforms have it. But if you're doing enough calls to strerror() that throwing a mutex around strerror() in your wrapper causes performance problems, those performance problems are probably the least of your problems) ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Mon, Jun 27, 2011 at 05:58:35PM -0700, Guy Harris wrote: > > We have about 240 calls to strerror(). > > ...and, unfortunately, a variant that converts to UTF-8 and is API-compatible > is non-trivial, > as any version that allocates a buffer for the result of the conversion would > leak memory > we just globally replaced strerror() with ws_strerror(). g_strerror() [1]? Returns : a UTF-8 string describing the error code. If the error code is unknown, it returns "unknown error ()". The string can only be used until the next call to g_strerror() [1] http://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-strerror ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 9:35 AM, Jakub Zawadzki wrote: > g_strerror() ? Yes, of course :) Thank you. -- Stig Bjørlykke ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On 28/06/2011 01:58, Guy Harris wrote: > > 2) Windows, where "Unicode" generally means "UTF-16", and APIs that > return strings encoded as sequences of octets rather than hexadectets > probably return strings in the local code page. > Is this a first sighting of a new word "hexadectet"? Google doesn't have an entry for it. -- Regards, Graham Bloice ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 10:14:34AM +0200, Stig Bj?rlykke wrote: > On Tue, Jun 28, 2011 at 9:35 AM, Jakub Zawadzki > wrote: > > g_strerror() ? > > Yes, of course :) Thank you. no problem ;-) Btw. I know that nowadays I'm the only one who uses non-utf locales on console, but when we print on console (stdout/stderr) I think we should use strerror() from libc, i.e. strerror() which don't recode message to utf-8. but well, it's nothing very important... ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 12:22 PM, Jakub Zawadzki wrote: > Btw. I know that nowadays I'm the only one who uses non-utf locales on > console, > but when we print on console (stdout/stderr) I think we should use strerror() > from libc, > i.e. strerror() which don't recode message to utf-8. Do we always know where the error message is used? I suspect file_open_error_message() is used both in GUI and tshark. -- Stig Bjørlykke ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris wrote: > 1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the > encoding (are you seeing the issue with Norwegian characters on your system? > If so, what's the setting of LANG?); I only had issues with Norwegian characters in file names reported via simple_dialog(), and my LANG is empty. Another problem is that we still have issues regarding UTF-8 strings in packets. We should really fix that... -- Stig Bjørlykke ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 2:25 AM, Graham Bloice wrote: > On 28/06/2011 01:58, Guy Harris wrote: >> >> 2) Windows, where "Unicode" generally means "UTF-16", and APIs that >> return strings encoded as sequences of octets rather than hexadectets >> probably return strings in the local code page. >> > Is this a first sighting of a new word "hexadectet"? No. > Google doesn't have an entry for it. An entry where? When I did a Google search for "hexadectet", it assumed I meant "hexadentate", but when I told it that I really did mean "hexadectet", it found items such as http://tools.ietf.org/id/draft-denog-v6ops-addresspartnaming-02.txt "4.7. Hexadectet "Hexadectet" is directly derived from IPv4's "octet", thus techni- cally correct and probably convenient to get used to. On the other hand, it is much harder to pronounce." and http://www.imc.org/ietf-822/old-archive1/msg02577.html from 1992: "I hasten to admit that modeling the path between the decoder and the richtext parser as a hexadectet stream has its own problems, mainly in that it makes the richtext parser harder to write (one has to be careful before using any standard string manipulation routines)." ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 3:22 AM, Jakub Zawadzki wrote: > Btw. I know that nowadays I'm the only one who uses non-utf locales on > console, > but when we print on console (stdout/stderr) I think we should use strerror() > from libc, > i.e. strerror() which don't recode message to utf-8. It's more complicated than that. There are many source of strings in the non-GUI output of the programs in the Wireshark suite: the message text itself - that's generally ASCII; file names - internally to those programs, those are in UTF-8; error strings for errno values and signal-name strings from signals - those might be in the current locale for strerror()/strsignal() and would be in UTF-8 with g_strerror()/g_strsignal(); etc. In addition, the non-GUI output of the program can be sent either to the terminal or to files. Output to the terminal should be in whatever character set the terminal expects. I'm not sure what would indicate the character set the terminal expects. On my machine, the "terminal" is Terminal.app, and can handle UTF-8 output; on other UN*Xes, in the GUI, it's probably similar. For consoles (which I'm using here to mean "no GUI, just the console of a workstation/personal computer") it might be less capable. For real terminals, it's almost certainly less-capable; I'm not sure whether there's ever be a real serial-port terminal that handles UTF-8. I don't know what the various terminal emulators for Windows, e.g. cmd.exe, do. Output to files, whether it's the result of redirecting the standard output or error of a command-line program to a file, or of one of the "export to a text file" operations in Wireshark, or..., is another matter. It might be that the character encoding should be the same as would be used on a terminal. In any case, that means that using strerror() is probably not going to be sufficient to fix the problem. What we might want to do is use UTF-8 everywhere we can, and, for non-GUI output, convert to the appropriate character encoding - whatever that might be - at the last minute. ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 3:33 AM, Stig Bjørlykke wrote: > Do we always know where the error message is used? > I suspect file_open_error_message() is used both in GUI and tshark. Yes - it's in epan. ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote: > On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris wrote: >>1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the >> encoding (are you seeing the issue with Norwegian characters on your system? >> If so, what's the setting of LANG?); > > I only had issues with Norwegian characters in file names reported via > simple_dialog(), and my LANG is empty. OK, what OS are you using? If it's a UN*X, try compiling and running the attached C program; does it print your name correctly on your terminal/terminal emulator (it writes it out in UTF-8), and does the file it creates (your name is its name - yeah, complete with a space between "Stig" and "Bjørlykke", and with no ".txt" at the end) have a name that shows up correctly if you do "ls"? If it's Windows, then you're probably just seeing bug 5715. > Another problem is that we still have issues regarding UTF-8 strings > in packets. We should really fix that... We have an issue regarding strings in packets in general. Strings might be in a number of encodings, including ASCII (meaning that any byte with the 8th bit set is something that shouldn't be there), other national variants of ISO 646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with no surrogate pairs"), ISO 8859/x for various values of x, various ISO 2022-based encodings (e.g., the EUC encodings), various national standards, various DOS and Windows code pages, various Mac OS encodings, EBCDIC, whatever encodings are used for SMS, etc., etc., etc, etc.: http://en.wikipedia.org/wiki/Template:Character_encoding I don't know whether all of the encodings in question can be mapped to Unicode without information loss. An arbitrary string of octets definitely can't be mapped to UTF-8 without information loss; consider a putatively UTF-8-encoded string that contains an octet sequence that's not valid in UTF-8. Perhaps, in the Wireshark dissection engine, we should initially store string values as a pair {encoding, counted octet string} (counted so that octets with the value 0 don't cause problems), and: when putting them into a textual representation of the protocol tree or into columns or something else to be shown to humans, map them to UTF-8, with anything that can't be mapped to UTF-8 - including, if the encoding is putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown as the Unicode replacement character U+FFFD; when comparing them in a display filter, attempt to map them to UTF-8 (and save the result), and: if the mapping fails, treat *all* comparisons except for inequality as failing, and treat comparisons for inequality as succeeding; if the mapping succeeds, compare the two strings; when making them available to software inside *Shark (C/C++ code, Lua code, Python code, etc.), attempt to convert them to whatever the appropriate representation is (presumably UTF-8), and have the routines to fetch those values support returning a "conversion failed" indication (or perhaps offer both a "convert for display to humans" version that uses U+FFFD for failure and a "convert for processing" version that returns "can't do it" for failure). Here's the program I mentioned above: norsk.c Description: Binary data ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 10:01 AM, Guy Harris wrote: > In any case, that means that using strerror() is probably not going to be > sufficient to fix the problem. What we might want to do is use UTF-8 > everywhere we can, and, for non-GUI output, convert to the appropriate > character encoding - whatever that might be - at the last minute. And then there's input. Input from the GUI is in UTF-8. We don't have any programs that read interactive user input from the command line, unless I've missed something, *but* we have programs that take arguments from the command line. If you're typing commands interactively, those are presumably in the encoding of the terminal or terminal emulator. If you're running commands from a script file, they're in whatever the character encoding is for the file. I note that GLib, at least, appears to allow the file name character encoding to differ from the locale's character encoding. I originally didn't see why this made sense, but I guess they might differ if, say, you're looking at somebody else's files and you're not both using UTF-8 and you're using different encodings, which could conceivably happen on UN*X. (As http://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-get-filename-charsets notes, "on Unix, regardless of the locale character set or G_FILENAME_ENCODING value, the actual file names present on a system might be in any random encoding or just gibberish", but I digress.) ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 10:27 AM, Guy Harris wrote: > when putting them into a textual representation of the protocol tree or > into columns or something else to be shown to humans, map them to UTF-8, with > anything that can't be mapped to UTF-8 - including, if the encoding is > putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown > as the Unicode replacement character U+FFFD; ...and, for "for display" conversions, we might want to convert control characters to "Control Pictures" symbols (0x to 0x001F convert to 0x2400 to 0x241f: ␀, ␁, etc. through ␟; 0x007F converts to 0x2421, i.e. ␡ - in the font in which this message is being displayed to me, those have the control character abbreviations displayed in really really small letters, diagonally from upper left to lower right; unfortunately, I see nothing for C1 control characters). ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 10:43 AM, Guy Harris wrote: > On Jun 28, 2011, at 10:27 AM, Guy Harris wrote: > >> when putting them into a textual representation of the protocol tree or >> into columns or something else to be shown to humans, map them to UTF-8, >> with anything that can't be mapped to UTF-8 - including, if the encoding is >> putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown >> as the Unicode replacement character U+FFFD; > > ...and, for "for display" conversions, we might want to convert control > characters to "Control Pictures" symbols (0x to 0x001F convert to 0x2400 > to 0x241f: ␀, ␁, etc. through ␟; 0x007F converts to 0x2421, i.e. ␡ - in the > font in which this message is being displayed to me, those have the control > character abbreviations displayed in really really small letters, diagonally > from upper left to lower right; unfortunately, I see nothing for C1 control > characters). http://en.wikipedia.org/wiki/Template:Unicode_chart_Control_Pictures That claims that this is "as of Unicode 6.0", so, if true, either they have a different name for control pictures for C1 control characters or there aren't any. (I have no idea what those other symbols are doing in there.) U+FFFD is often shown as a white question mark inside a black diamond: http://en.wikipedia.org/wiki/Specials_(Unicode_block) Oh, and if we're going to be extremely completist, there are the EBCDIC control characters, for which there are not always control pictures; see table 5.1: ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt This was from 1998. I don't know whether any of the proposals were accepted. ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 10:27 AM, Guy Harris wrote: > We have an issue regarding strings in packets in general. Strings might be > in a number of encodings, including ASCII (meaning that any byte with the 8th > bit set is something that shouldn't be there), other national variants of ISO > 646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with > no surrogate pairs"), ISO 8859/x for various values of x, various ISO > 2022-based encodings (e.g., the EUC encodings), various national standards, > various DOS and Windows code pages, various Mac OS encodings, EBCDIC, > whatever encodings are used for SMS, etc., etc., etc, etc.: > > http://en.wikipedia.org/wiki/Template:Character_encoding As long as I'm piling up a ton of information about humanity's twisty little maze of character encodings, all different: SMS: https://secure.wikimedia.org/wikipedia/en/wiki/GSM_03.38 ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 7:27 PM, Guy Harris wrote: > OK, what OS are you using? Snow:~ stig$ uname -a Darwin Snow.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 Snow:~ stig$ echo $LANG Snow:~ stig$ gcc norsk.c -o norsk && ./norsk Stig Bjørlykke Now creating a file with Stig's name as its name Snow:~ stig$ ls -l Stig\ Bjørlykke -rw-r--r-- 1 stig staff 16 Jun 28 21:20 Stig Bjørlykke Everything works here. I don't know anything about Windows. -- Stig Bjørlykke ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 28, 2011, at 12:25 PM, Stig Bjørlykke wrote: > On Tue, Jun 28, 2011 at 7:27 PM, Guy Harris wrote: >> OK, what OS are you using? > > Snow:~ stig$ uname -a > Darwin ... Well, that answers *that* question. :-) So the locale's encoding should probably be UTF-8, given that it's OS X. However, if LANG is blank, you presumably don't have Terminal set up to "Set local enviornment variables on startup" (Preferences > Settings > Advanced, at the bottom); I think I turned that on a while ago, perhaps to get some UN*X software to correctly handle UTF-8. Just out of curiosity, if you set that (or if you explicitly set LANG to something appropriate ending in ".UTF-8", whether it's no_NO.UTF-8, nn_NO.UTF-8, nb_NO.UTF-8, en_NO.UTF-8, or some other setting), does that make the GUI problem go away with a version of Wireshark *without* the http://anonsvn.wireshark.org/viewvc?revision=37812&view=revision changes? (Whether it has other side-effects is another matter; it might, for example, affect the parsing and output of numbers and dates, for better or for worse.) ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 9:37 PM, Guy Harris wrote: > However, if LANG is blank, you presumably don't have Terminal set up to "Set > local enviornment variables on startup" (Preferences > Settings > Advanced, > at the bottom); Actually I have "Set local environment variables on startup" checked. I also have "Character encoding: Unicode (UTF-8)". I use English as my preferred language and Norway as region. > Just out of curiosity, if you set that (or if you explicitly set LANG to > something appropriate ending in ".UTF-8", whether it's no_NO.UTF-8, > nn_NO.UTF-8, nb_NO.UTF-8, en_NO.UTF-8, or some other setting), does that make > the GUI problem go away with a version of Wireshark *without* the > > http://anonsvn.wireshark.org/viewvc?revision=37812&view=revision > > changes? Normally I run Wireshark.app generated from 'make osx-install', and getenv("LANG") returns ".UTF-8". No luck with rev < 37812. When running from command line with LANG=no_NO.UTF-8 I get this: (process:65298): Gtk-WARNING **: Locale not supported by C library. Using the fallback 'C' locale. , but I get a correct error message with rev < 37812 and "æøå.pcap" or "Проверка.pcap" as filename. So; if I run with a UTF-8 locale the g_locale_to_utf8() will not do any conversion, and when running with a locale without UTF-8 (or not legal) we get the error in bug 5715. The bug was reported for Windows, but I don't know how it works there. I have tested on OSX and Ubuntu Linux. Maybe we should include the locale in our about box? We may use it in bug reports. -- Stig Bjørlykke ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On 28/06/2011 18:27, Guy Harris wrote: > On Jun 28, 2011, at 6:10 AM, Stig Bjørlykke wrote: > >> On Tue, Jun 28, 2011 at 2:58 AM, Guy Harris wrote: >>>1) UN*Xes where LANG etc. aren't set to a locale with UTF-8 as the >>> encoding (are you seeing the issue with Norwegian characters on your >>> system? If so, what's the setting of LANG?); >> I only had issues with Norwegian characters in file names reported via >> simple_dialog(), and my LANG is empty. > OK, what OS are you using? If it's a UN*X, try compiling and running the > attached C program; does it print your name correctly on your > terminal/terminal emulator (it writes it out in UTF-8), and does the file it > creates (your name is its name - yeah, complete with a space between "Stig" > and "Bjørlykke", and with no ".txt" at the end) have a name that shows up > correctly if you do "ls"? If it's Windows, then you're probably just seeing > bug 5715. > >> Another problem is that we still have issues regarding UTF-8 strings >> in packets. We should really fix that... > We have an issue regarding strings in packets in general. Strings might be > in a number of encodings, including ASCII (meaning that any byte with the 8th > bit set is something that shouldn't be there), other national variants of ISO > 646, UTF-8, UTF-16, UCS-2 (meaning "only the Basic Multilingual plane, with > no surrogate pairs"), ISO 8859/x for various values of x, various ISO > 2022-based encodings (e.g., the EUC encodings), various national standards, > various DOS and Windows code pages, various Mac OS encodings, EBCDIC, > whatever encodings are used for SMS, etc., etc., etc, etc.: > > http://en.wikipedia.org/wiki/Template:Character_encoding > > I don't know whether all of the encodings in question can be mapped to > Unicode without information loss. An arbitrary string of octets definitely > can't be mapped to UTF-8 without information loss; consider a putatively > UTF-8-encoded string that contains an octet sequence that's not valid in > UTF-8. > > Perhaps, in the Wireshark dissection engine, we should initially store string > values as a pair {encoding, counted octet string} (counted so that octets > with the value 0 don't cause problems), and: > > when putting them into a textual representation of the protocol tree or > into columns or something else to be shown to humans, map them to UTF-8, with > anything that can't be mapped to UTF-8 - including, if the encoding is > putatively UTF-8, octet sequences that aren't valid UTF-8 sequences - shown > as the Unicode replacement character U+FFFD; > > when comparing them in a display filter, attempt to map them to UTF-8 > (and save the result), and: > > if the mapping fails, treat *all* comparisons except for > inequality as failing, and treat comparisons for inequality as succeeding; > > if the mapping succeeds, compare the two strings; > > when making them available to software inside *Shark (C/C++ code, Lua > code, Python code, etc.), attempt to convert them to whatever the appropriate > representation is (presumably UTF-8), and have the routines to fetch those > values support returning a "conversion failed" indication (or perhaps offer > both a "convert for display to humans" version that uses U+FFFD for failure > and a "convert for processing" version that returns "can't do it" for > failure). > > Here's the program I mentioned above: For reference, here's the test executable output on Win7, using the SDK 7.0 build environment (a cmd.prompt): c:\temp>test Stig Bj├©rlykke Now creating a file with Stig's name as its name c:\temp>dir Volume in drive C has no label. Volume Serial Number is D845-44D4 Directory of c:\temp 29/06/2011 10:30 . 29/06/2011 10:30 .. 29/06/2011 10:3017 Stig Bjørlykke 29/06/2011 10:2877,312 test.exe 2 File(s) 77,329 bytes 2 Dir(s) 65,078,947,840 bytes free The output of the executable was the same using Powershell. -- Regards, Graham Bloice ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 29, 2011, at 2:37 AM, Graham Bloice wrote: > For reference, here's the test executable output on Win7, using the SDK 7.0 > build environment (a cmd.prompt): Not surprisingly, it doesn't work. Microsoft introduced Unicode support when they introduced Win32; as they were introducing a new API, they could make the versions of the API that support Unicode take UCS-2 (later UTF-16) strings as arguments. They also offered "ASCII" versions, which took strings in the local code page as arguments. This also applies to the C library's routines, such as open()/_open(). UN*X systems already had a well-established API when they introduced Unicode support, and they had what amounted to code pages (the various ISO 8859/x encodings, the EUC encodings, assorted other encodings); instead, they added a new "code page", with UTF-8 encoding. The program was written for UN*X, to test whether, in the user's locale, UTF-8 strings work. In Windows, the ASCII API it was using to create a file would take your local code page, not UTF-8, as the string encoding, and I suspect cmd.exe also expects "ASCII" output from programs - such as when the test program was printing Stig's name - to be in the local code page, not UTF-8. This is why GLib has file functions that do mapping on file names; the page at http://developer.gnome.org/glib/stable/glib-File-Utilities.html says There is a group of functions which wrap the common POSIX functions dealing with filenames (g_open(), g_rename(), g_mkdir(), g_stat(),g_unlink(), g_remove(), g_fopen(), g_freopen()). The point of these wrappers is to make it possible to handle file names with any Unicode characters in them on Windows without having to use ifdefs and the wide character API in the application code. The pathname argument should be in the GLib file name encoding. On POSIX this is the actual on-disk encoding which might correspond to the locale settings of the process (or the G_FILENAME_ENCODING environment variable), or not. On Windows the GLib file name encoding is UTF-8. Note that the Microsoft C library does not use UTF-8, but has separate APIs for current system code page and wide characters (UTF-16). The GLib wrappers call the wide character API if present (on modern Windows systems), otherwise convert to/from the system code page. Another group of functions allows to open and read directories in the GLib file name encoding. These are g_dir_open(), g_dir_read_name(),g_dir_rewind(), g_dir_close(). This is also why we have our own copies of some of those functions on Windows, and wrap them ourselves (so that we don't require GLib 2.6, which introduced them, for all platforms). ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Tue, Jun 28, 2011 at 7:01 PM, Guy Harris wrote: > In any case, that means that using strerror() is probably not going to be > sufficient to fix the problem. What we might want to do is use UTF-8 > everywhere we can, and, for non-GUI output, convert to the appropriate > character encoding - whatever that might be - at the last minute. Ok, what about trying to convert back to locale when output error messages from tshark? Something like the attached patch, maybe? -- Stig Bjørlykke tshark-print-locale.patch Description: Binary data ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe
Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
On Jun 29, 2011, at 1:45 PM, Stig Bjørlykke wrote: > Ok, what about trying to convert back to locale when output error > messages from tshark? > Something like the attached patch, maybe? Something like that, but with a g_free() of "string" afterwards. :-) ___ Sent via:Wireshark-dev mailing list Archives:http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe