[GENERAL] Using tsearch2 in a Bayesian filter

Alban Hertroys Sun, 06 Apr 2008 04:14:11 -0700

Hi all,

In my spare time I've started on a general purpose Bayesian filterbased on the now built-in tsearch2 functionality. The ability to stemwords from a message into lexemes, removing stop words and gistindexes look promising enough to attempt this. However, my experiencewith tsearch is somewhat limited, so I have a few questions...

The messages entering the filter will be in different languages andencoding. For example, I get a lot of Cyrillic spam these days, whileI get a lot of English messages and a few in Dutch. Especially thespam is likely to lie about it's encoding. Some messages will beplain text, but many will be HTML.

- Is it possible to stem words from that wide a variety of content?
- If so, what approach would be best?

- Do I need to strip out the HTML tags or can they serve as lexemesthemselves?

Next, to determine the probability of a lexeme being of a certainclassification (for example spam or not spam), I need to be able tocount the number of occurrences of that lexeme in a text. I can'tstore a probability, as the numbers aren't fixed[*] (was hoping toabuse score() here, but that's probably a no-op). I haven't found anytsearch functions to determine the number of occurrences of eachlexeme in a text. Ideally I'd have a resultset with ( lexeme, numberof occurrences) tuples, so that I can use that directly in a query.

- How do I determine the number of occurrences of each lexeme in a text?

Thanks for your time.

[*] As more messages enter the system, there will be more occurrencesof lexemes in messages and in classifications. If I start out withone lexeme occurring once in a single message, the chance that lexemeis in a message is 1. As soon as another message arrives notcontaining that lexeme, the chance is 0.5. The number of messages,occurrence of lexemes in messages and classifications is acontinuously moving number, so I will need the numbers theprobability was based on (might still decide to add a column with theprobability calculated from those numbers for speed, of course).


Regards,

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,47f8b050927661534911704!



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Using tsearch2 in a Bayesian filter

Reply via email to