Your message dated Mon, 1 Mar 2010 14:09:38 -0600 with message-id <[email protected]> and subject line Re: mawk: UTF-8 multibyte characters are not handled properly has caused the Debian Bug report #404980, regarding mawk: UTF-8 multibyte characters are not handled properly to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 404980: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=404980 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: gawk Version: 1:3.1.4-2 Severity: important gawk does not handle UTF-8 multibyte characters properly. Here's an example: $ cat example.txt A Only_a_singlebyte_character_here_(UTF-8:_41) Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96) € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac) $ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }' A Only_a_singlebyte_character_here_(UTF-8:_41) Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96) € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac) As we can see the format specifier %-5s does not calculate field widths correctly when string contains multibyte characters. Unfortunately this makes gawk's field widths mostly unusable with UTF-8 locale. -- System Information: Debian Release: 3.1 APT prefers testing APT policy: (850, 'testing'), (800, 'unstable') Architecture: i386 (i686) Kernel: Linux 2.6.8-2-k7 Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8) Versions of packages gawk depends on: ii libc6 2.3.2.ds1-22 GNU C Library: Shared libraries an -- no debconf information
--- End Message ---
--- Begin Message ---clone 404980 -1 retitle -1 mawk: Please add a function wrapping wcswidth() severity -1 wishlist tags -1 + upstream thanks Hi Teemu, Teemu Likonen wrote: > $ cat example.txt > > A Only_a_singlebyte_character_here_(UTF-8:_41) > Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96) > € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac) > > > $ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }' > > A Only_a_singlebyte_character_here_(UTF-8:_41) > Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96) > € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac) > > > As we can see the format specifier %-5s does not calculate field widths > correctly when string contains multibyte characters. This behavior is shared with C printf, and sadly it is is required. POSIX is clear about this: the numeric argument to a %s format is a number of bytes. See the target of the “File Format Notation” link in http://www.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_10 So closing. On the other hand, the functionality you are asking for would be very nice to have in some form. > Unfortunately this > makes gawk's field widths mostly unusable with UTF-8 locale. In C, it is understandable why it was chosen to use number of bytes, to avoid nonobvious buffer overflow bugs with sprintf(). That problem does not apply to awk, so maybe it would be possible to convince the Open Group people to change the behavior (or add a new function)? See http://unix.org/2008edition/ for the latest standards, http://austingroupbugs.net/main_page.php to contact the standards bodies. Jonathan
--- End Message ---

