Hi all,

In my spare time I've started on a general purpose Bayesian filter based on the now built-in tsearch2 functionality. The ability to stem words from a message into lexemes, removing stop words and gist indexes look promising enough to attempt this. However, my experience with tsearch is somewhat limited, so I have a few questions...

The messages entering the filter will be in different languages and encoding. For example, I get a lot of Cyrillic spam these days, while I get a lot of English messages and a few in Dutch. Especially the spam is likely to lie about it's encoding. Some messages will be plain text, but many will be HTML.
- Is it possible to stem words from that wide a variety of content?
- If so, what approach would be best?
- Do I need to strip out the HTML tags or can they serve as lexemes themselves?

Next, to determine the probability of a lexeme being of a certain classification (for example spam or not spam), I need to be able to count the number of occurrences of that lexeme in a text. I can't store a probability, as the numbers aren't fixed[*] (was hoping to abuse score() here, but that's probably a no-op). I haven't found any tsearch functions to determine the number of occurrences of each lexeme in a text. Ideally I'd have a resultset with ( lexeme, number of occurrences) tuples, so that I can use that directly in a query.
- How do I determine the number of occurrences of each lexeme in a text?

Thanks for your time.

[*] As more messages enter the system, there will be more occurrences of lexemes in messages and in classifications. If I start out with one lexeme occurring once in a single message, the chance that lexeme is in a message is 1. As soon as another message arrives not containing that lexeme, the chance is 0.5. The number of messages, occurrence of lexemes in messages and classifications is a continuously moving number, so I will need the numbers the probability was based on (might still decide to add a column with the probability calculated from those numbers for speed, of course).


Alban Hertroys

If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:

Reply via email to