Sven R. Kunze wrote: > the following stemming results made me curious: > > select to_tsvector('german', 'systeme'); > 'system':1 > select to_tsvector('german', 'systemes'); > 'system':1 > select to_tsvector('german', 'systems'); > 'system':1 > select to_tsvector('german', 'systemen'); > 'system':1 > select to_tsvector('german', 'system'); > 'syst':1 > > > First of all, this seems to be a bug in the German stemmer. Where can I > fix it?
As far as I understand, the stemmer is not perfect, it is just a "best effort" at German stemming. It does not have a dictionary of valid German words, but uses an algorithm based on only the occurring letters. This web page describes the algorithm: http://snowball.tartarus.org/algorithms/german/stemmer.html I guess that the Snowball folks (and PostgreSQL) would be interested if you could come up with a better algorithm. In this specific case, the stemmer goes wrong because "System" is a foreign word whose ending is atypical for German. The algorithm cannot distinguish between "System" and, say, "lautem" or "bestem". > Second, and more importantly, as I understand it, the stemmed version of > a word should be considered normalized. That is, all other versions of > that stem should be mapped to it as well. The interesting problem here > is that PostgreSQL maps the stem itself ('system') to a completely > different stem ('syst'). > > Should a stem not remain stable even when to_tsvector is called on it > multiple times? That's a possible position, but consider that a stem is not necessarily a valid German word. If you treat it as a German word (by stemming it), the results might not be what you desire. For example: test=> select to_tsvector('german', 'linsen'); to_tsvector ------------- 'lins':1 (1 row) test=> select to_tsvector('german', 'lins'); to_tsvector ------------- 'lin':1 (1 row) I guess that your real problem here is that a search for "system" will not find "systeme", which is indeed unfortunate. But until somebody can come up with a better stemming algorithm, cases like that can always occur. Yours, Laurenz Albe -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general