Your message dated Mon, 1 Mar 2010 14:09:38 -0600
with message-id <[email protected]>
and subject line Re: mawk: UTF-8 multibyte characters are not handled properly
has caused the Debian Bug report #404980,
regarding mawk: UTF-8 multibyte characters are not handled properly
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
404980: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=404980
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: gawk
Version: 1:3.1.4-2
Severity: important


gawk does not handle UTF-8 multibyte characters properly. Here's an
example:


$ cat example.txt

A Only_a_singlebyte_character_here_(UTF-8:_41)
Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€ A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)


$ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'

A    Only_a_singlebyte_character_here_(UTF-8:_41)
Ö   A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
€  A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)


As we can see the format specifier %-5s does not calculate field widths
correctly when string contains multibyte characters. Unfortunately this
makes gawk's field widths mostly unusable with UTF-8 locale.


-- System Information:
Debian Release: 3.1
  APT prefers testing
  APT policy: (850, 'testing'), (800, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.8-2-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages gawk depends on:
ii  libc6                       2.3.2.ds1-22 GNU C Library: Shared libraries an

-- no debconf information


--- End Message ---
--- Begin Message ---
clone 404980 -1
retitle -1 mawk: Please add a function wrapping wcswidth()
severity -1 wishlist
tags -1 + upstream
thanks

Hi Teemu,

Teemu Likonen wrote:

> $ cat example.txt
> 
> A Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> € A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
> 
> 
> $ cat example.txt | awk '{ printf "%-5s%s\n",$1, $2 }'
> 
> A    Only_a_singlebyte_character_here_(UTF-8:_41)
> Ö   A_letter_which_takes_two_bytes_(UTF-8:_c3_96)
> €  A_currency_symbol_which_takes_three_bytes_(UTF-8:_e2_82_ac)
> 
> 
> As we can see the format specifier %-5s does not calculate field widths
> correctly when string contains multibyte characters.

This behavior is shared with C printf, and sadly it is is required.
POSIX is clear about this: the numeric argument to a %s format is a
number of bytes.  See the target of the “File Format Notation” link in
http://www.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_10
So closing.

On the other hand, the functionality you are asking for would be very
nice to have in some form.

> Unfortunately this
> makes gawk's field widths mostly unusable with UTF-8 locale.

In C, it is understandable why it was chosen to use number of bytes,
to avoid nonobvious buffer overflow bugs with sprintf().  That problem
does not apply to awk, so maybe it would be possible to convince the
Open Group people to change the behavior (or add a new function)?

See http://unix.org/2008edition/ for the latest standards,
http://austingroupbugs.net/main_page.php to contact the standards
bodies.

Jonathan


--- End Message ---

Reply via email to