Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-24 Thread Clint Adams
> Having lines
>   charset_default=utf-8
>   unicode=yes

Isn't unicode=yes already the default?


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Samuel Thibault
Clint Adams, le Sun 24 Dec 2006 09:23:44 -0500, a écrit :
> > Having lines
> >   charset_default=utf-8
> >   unicode=yes
> 
> Isn't unicode=yes already the default?

Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
attached patch (yes, I had to fix the configure.ac script).

But actually, text tools should rather use the current locale's charset
(from nl_langinfo(CODESET)), instead of hardcoding it in configuration
files...

Samuel
diff -ur bogofilter-1.1.3/configure bogofilter-1.1.3-mine/configure
--- bogofilter-1.1.3/configure  2006-12-03 05:17:15.0 +0100
+++ bogofilter-1.1.3-mine/configure 2006-12-27 01:06:32.0 +0100
@@ -6137,6 +6137,7 @@
 #define DEFAULT_CHARSET "$withval"
 _ACEOF
 
+   DEFAULT_CHARSET="$withval"
 
 fi
 
diff -ur bogofilter-1.1.3/configure.ac bogofilter-1.1.3-mine/configure.ac
--- bogofilter-1.1.3/configure.ac   2006-12-03 04:55:30.0 +0100
+++ bogofilter-1.1.3-mine/configure.ac  2006-12-27 01:05:28.0 +0100
@@ -336,6 +336,7 @@
AC_DEFINE_UNQUOTED(DEFAULT_CHARSET, 
["$withval"], 
[Use specified default charset instead of iso-8859-1])
+   [DEFAULT_CHARSET="$withval"]
 )
 
 AC_SUBST(ENCODING)
Seulement dans bogofilter-1.1.3-mine: configure-stamp
diff -ur bogofilter-1.1.3/debian/rules bogofilter-1.1.3-mine/debian/rules
--- bogofilter-1.1.3/debian/rules   2006-12-27 01:05:50.0 +0100
+++ bogofilter-1.1.3-mine/debian/rules  2006-12-27 01:09:19.0 +0100
@@ -26,11 +26,11 @@
 
$(INSTALL) -d obj-db obj-qdbm obj-sqlite
 
-   cd obj-db && CFLAGS="$(CFLAGS)" ../configure --with-database=db \
+   cd obj-db && CFLAGS="$(CFLAGS)" ../configure --with-database=db 
--with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc
-   cd obj-qdbm && CPPFLAGS="-I/usr/include/qdbm" CFLAGS="$(CFLAGS)" 
../configure --with-database=qdbm --program-suffix=-qdbm \
+   cd obj-qdbm && CPPFLAGS="-I/usr/include/qdbm" CFLAGS="$(CFLAGS)" 
../configure --with-database=qdbm --program-suffix=-qdbm --with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc
-   cd obj-sqlite && CFLAGS="$(CFLAGS)" ../configure --with-database=sqlite 
--program-suffix=-sqlite \
+   cd obj-sqlite && CFLAGS="$(CFLAGS)" ../configure --with-database=sqlite 
--program-suffix=-sqlite --with-charset=utf-8 \
--prefix=/usr --mandir=\$${prefix}/share/man --sysconfdir=/etc 
&& \
sed -i 's/^INTEGRITY_TESTS.*/INTEGRITY_TESTS=t.lock1/' 
src/tests/Makefile
 
Seulement dans bogofilter-1.1.3-mine: obj-db
Seulement dans bogofilter-1.1.3-mine: obj-qdbm
Seulement dans bogofilter-1.1.3-mine: obj-sqlite


Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Clint Adams
> > >   charset_default=utf-8
> > >   unicode=yes
> > 
> > Isn't unicode=yes already the default?
> 
> Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
> attached patch (yes, I had to fix the configure.ac script).
> 
> But actually, text tools should rather use the current locale's charset
> (from nl_langinfo(CODESET)), instead of hardcoding it in configuration
> files...

We are talking about two different things.  unicode=yes/no sets the
charset used in the database.  charset_default sets the charset assumed
for messages without proper headers.  I have seen no instances of mail
in the wild where the charset was unspecified yet was actually proper
UTF-8.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-26 Thread Samuel Thibault
Clint Adams, le Tue 26 Dec 2006 19:26:48 -0500, a écrit :
> > > >   charset_default=utf-8
> > > >   unicode=yes
> > > 
> > > Isn't unicode=yes already the default?
> > 
> > Nope, iso-8859-1 is (see configure.ac).  But it could be by applying the
> > attached patch (yes, I had to fix the configure.ac script).
> > 
> > But actually, text tools should rather use the current locale's charset
> > (from nl_langinfo(CODESET)), instead of hardcoding it in configuration
> > files...
> 
> We are talking about two different things.  unicode=yes/no sets the
> charset used in the database.

Ah, sorry.  Yes, unicode is the default.

> charset_default sets the charset assumed for messages without proper
> headers.  I have seen no instances of mail in the wild where the
> charset was unspecified yet was actually proper UTF-8.

Ah, ok, sorry, then the bug is probably not valid, I guess.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Clint Adams kirjoitti (26.12.2006 klo 19.26):

> charset_default sets the charset assumed for messages without proper
> headers.  I have seen no instances of mail in the wild where the
> charset was unspecified yet was actually proper UTF-8.

I didn't know that bogofilter is able to check message headers for
correct encoding. I use KMail (KDE's email client) and it converts
messages to locale charset before sending them to bogofilter. How do
other programs behave? What is the correct behaviour (if there is one)?

If this is just KMail's problem that bogofilter database gets
(practically) corrupted when locale's charset is different than
bogofilter's charset_default, then, yes, this is not a bogofilter bug.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Clint Adams kirjoitti (24.12.2006 klo 9.23):

> > Having lines
> >   charset_default=utf-8
> >   unicode=yes
> 
> Isn't unicode=yes already the default?

Yes, it is the default. I think it's a good idea to define "unicode=yes"
explicitly because defaults may change (in this case, I don't believe it
will, though).


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Samuel Thibault
Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit :
> I didn't know that bogofilter is able to check message headers for
> correct encoding. I use KMail (KDE's email client) and it converts
> messages to locale charset before sending them to bogofilter. How do
> other programs behave? What is the correct behaviour (if there is one)?

I'd say the correct behavior is to just keep the message intact.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Samuel Thibault kirjoitti (27.12.2006 klo 10.49):

> Teemu Likonen, le Wed 27 Dec 2006 10:23:39 +0200, a écrit :
> > I didn't know that bogofilter is able to check message headers for
> > correct encoding. I use KMail (KDE's email client) and it converts
> > messages to locale charset before sending them to bogofilter. How do
> > other programs behave? What is the correct behaviour (if there is
> > one)?
> 
> I'd say the correct behavior is to just keep the message intact.

I checked how bogofilter works with messages with different encodings
and Content-Type headers. Bogofilter works as it should: it checks the
message's Content-Type header and get's the charset from there. With
"unicode=yes" (which is the default) bogofilter converts the message to
UTF-8 and stores words to it's database.

If charset is not defined in message's Content-Type headers, bogofilter
uses it's own charset_default setting (default is ISO-8859-1). I think
ISO-8859-1 is a good default: I believe most of the messages without
Content-Type headers are in some kind of Western European charset.
Probably most of the spam is English.

So, my bug report was pretty pointless from bogofilter's point of view.
:) I guess this bug can be closed. At least I downgraded the severity to
"normal".

There remains this KMail problem, though. Maybe it's worth filing a new
report.



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Samuel Thibault
Teemu Likonen, le Thu 28 Dec 2006 00:16:20 +0200, a écrit :
> If charset is not defined in message's Content-Type headers, bogofilter
> uses it's own charset_default setting (default is ISO-8859-1). I think
> ISO-8859-1 is a good default: I believe most of the messages without
> Content-Type headers are in some kind of Western European charset.
> Probably most of the spam is English.

Maybe cp1252 would even be more useful, since it is an over-set of
iso-8859-1 and it is used by a lot of mailers running on another
well-known OS.

Samuel



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-27 Thread Teemu Likonen
Samuel Thibault kirjoitti (27.12.2006 klo 23.25):

> Maybe cp1252 would even be more useful, since it is an over-set of
> iso-8859-1 and it is used by a lot of mailers running on another
> well-known OS.

Indeed. Then it would be "charset_default=Windows-1252" or
"charset_default=cp1252".


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#402898: /etc/bogofilter.cf should define UTF-8 as default charset since it is Debian's default

2006-12-13 Thread Teemu Likonen
Package: bogofilter
Version: 1.1.3-1
Severity: serious

I report this as "serious" because this _should_ be fixed before Etch is
released. This bug causes bogofilter to work incorrectly in UTF-8
systems (which is Etch's default).

Debian Etch uses UTF-8 locales and charset as default. Bogofilter uses
ISO-8859-1 as the system default. This usually causes garbage words to
user's ~/.bogofilter/wordlist.db since the default charset for
_database_ is Unicode/UTF-8.

To reproduce:


Use UTF-8 locale:

$ locale
LANG=fi_FI.UTF-8
LC_CTYPE="fi_FI.UTF-8"
LC_NUMERIC="fi_FI.UTF-8"
LC_TIME="fi_FI.UTF-8"
LC_COLLATE="fi_FI.UTF-8"
LC_MONETARY="fi_FI.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="fi_FI.UTF-8"
LC_NAME="fi_FI.UTF-8"
LC_ADDRESS="fi_FI.UTF-8"
LC_TELEPHONE="fi_FI.UTF-8"
LC_MEASUREMENT="fi_FI.UTF-8"
LC_IDENTIFICATION="fi_FI.UTF-8"
LC_ALL=


Use Bogofilter's default system charset (or define it to some
other 8 bit charset):
  charset_default=iso-8859-1
Use Bogofilter's default word database charset (Unicode/UTF-8):
  unicode=yes

(These can be defined in /etc/bogofilter.cf or ~/.bogofilter.cf)


Some background information: letter "ä" is U+00E4 LATIN SMALL LETTER
A WITH DIAERESIS and in UTF-8 encoding it takes two bytes: $c3 $a4.


$ mv ~/.bogofilter/wordlist.db ~/wordlist.db-backup
$ echo "äiti" | bogofilter -n
$ bogoutil -d ~/.bogofilter/wordlist.db
head:äiti 0 1 20061213

This example shows that the letter "ä" is encoded _twice_ with UTF-8.
The command "echo" prints letter "ä" encoded with UTF-8, Bogofilter
thinks it is in ISO-8859-1 and encodes both bytes separately: $c3
becomes "Ã" (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and $a4 becomes
"¤" (U+00A4 CURRENCY SIGN).

Having lines
  charset_default=utf-8
  unicode=yes
in /etc/bogofilter.cf file characters are encoded correctly.


-- System Information:
Debian Release: 4.0
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18-3-k7
Locale: LANG=fi_FI.UTF-8, LC_CTYPE=fi_FI.UTF-8 (charmap=UTF-8)

Versions of packages bogofilter depends on:
ii  bogofilter-bdb1.1.3-1a fast Bayesian spam filter (Berke

bogofilter recommends no packages.

-- no debconf information