Your message dated Fri, 5 Jan 2007 10:01:04 +0200
with message-id <[EMAIL PROTECTED]>
and subject line Not a bogofilter bug
has caused the attached Bug report to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what I am
talking about this indicates a serious mail system misconfiguration
somewhere. Please contact me immediately.)
Debian bug tracking system administrator
(administrator, Debian Bugs database)
--- Begin Message ---
Package: bogofilter
Version: 1.1.3-1
Severity: serious
I report this as "serious" because this _should_ be fixed before Etch is
released. This bug causes bogofilter to work incorrectly in UTF-8
systems (which is Etch's default).
Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses
ISO-8859-1 as the system default. This usually causes garbage words to
user's ~/.bogofilter/wordlist.db since the default charset for
_database_ is Unicode/UTF-8.
To reproduce:
Use UTF-8 locale:
$ locale
LANG=fi_FI.UTF-8
LC_CTYPE="fi_FI.UTF-8"
LC_NUMERIC="fi_FI.UTF-8"
LC_TIME="fi_FI.UTF-8"
LC_COLLATE="fi_FI.UTF-8"
LC_MONETARY="fi_FI.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="fi_FI.UTF-8"
LC_NAME="fi_FI.UTF-8"
LC_ADDRESS="fi_FI.UTF-8"
LC_TELEPHONE="fi_FI.UTF-8"
LC_MEASUREMENT="fi_FI.UTF-8"
LC_IDENTIFICATION="fi_FI.UTF-8"
LC_ALL=
Use Bogofilter's default system charset (or define it to some
other 8 bit charset):
charset_default=iso-8859-1
Use Bogofilter's default word database charset (Unicode/UTF-8):
unicode=yes
(These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf)
Some background information: letter "ä" is U+00E4 LATIN SMALL LETTER
A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4.
$ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup
$ echo "äiti" | bogofilter -n
$ bogoutil -d ~/.bogofilter/wordlist.db
head:äiti 0 1 20061213
This example shows that the letter "ä" is encoded _twice_ with UTF-8.
The command "echo" prints letter "ä" encoded with UTF-8, Bogofilter
thinks it is in ISO-8859-1 and encodes both bytes separately: $c3
becomes "Ã" (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes
"¤" (U+00A4 CURRENCY SIGN).
Having lines
charset_default=utf-8
unicode=yes
in /etc/bogofilter.cf file characters are encoded correctly.
-- System Information:
Debian Release: 4.0
APT prefers testing
APT policy: (900, 'testing')
Architecture: i386 (i686)
Shell: /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18-3-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)
Versions of packages bogofilter depends on:
ii bogofilter-bdb 1.1.3-1 a fast Bayesian spam filter (Berke
bogofilter recommends no packages.
-- no debconf information
--- End Message ---
--- Begin Message ---
Since bogofilter seems to be working as expected, I close this.
--- End Message ---