Additionally, your algorithm reference for step1c is from the "Snowball English (Porter2)" algorithm. The implementation used in SQLite is for the original "Porter" algorithm discussed here: http://tartarus.org/~martin/PorterStemmer/
HTH. -SHane On Wed, Feb 24, 2010 at 10:05 AM, D. Richard Hipp <d...@hwaci.com> wrote: > We got the Porter stemmer code directly from Martin Porter. > > I'm sorry it does not work like you want it to. Unfortunately, we > cannot change it now without introducing a serious incompatibility > with the millions and millions of applications already in the field > that are using the existing implementation. > > FTS3 has a pluggable stemmer module. You can write your own stemmer > that works "correctly" if you like, and link it in for use in your > applications. We will also investigate making your recommended > changes for FTS4. However, in order to maintain backwards > compatibility of FTS3, we cannot change the stemmer algorithm, even to > fix a "bug". > > On Feb 24, 2010, at 9:59 AM, James Berry wrote: > > > Can somebody please clarify the bug reporting process for sqlite? My > > understanding is that it's not possible to file bug reports > > directly, and that the advise is to write to the user list first. > > I've done that (below) but have no response so far and am concerned > > that this means the bug report will just be forgotten others, as > > well as by me. > > > > How does this bug move from a message on a list to a ticket (and > > ultimately a patch, we hope) in the system? > > > > James > > > > On Feb 22, 2010, at 2:51 PM, James Berry wrote: > > > >> I'm writing to report a bug in the porter-stemmer algorithm > >> supplied as part of the FTS3 implementation. > >> > >> The stemmer has an inverted logic error that prevents it from > >> properly stemming words of the following form: > >> > >> dry -> dri > >> cry -> cri > >> > >> This means, for instance, that the following words don't stem the > >> same: > >> > >> dried -> dri -doesn't match- dry > >> cried -> cry -doesn't match- cry > >> > >> The bug seems to have been introduced as a simple logic error by > >> whoever wrote the stemmer code. The original description of step 1c > >> is here: http://snowball.tartarus.org/algorithms/english/stemmer.html > >> > >> Step 1c: > >> replace suffix y or Y by i if preceded by a non-vowel which > is > >> not the first letter of the word (so cry -> cri, by -> by, say -> > >> say) > >> > >> But the code in sqlite reads like this: > >> > >> /* Step 1c */ > >> if( z[0]=='y' && hasVowel(z+1) ){ > >> z[0] = 'i'; > >> } > >> > >> In other words, sqlite turns the y into an i only if it is preceded > >> by a vowel (say -> sai), while the algorithm intends this to be > >> done if it is _not_ preceded by a vowel. > >> > >> But there are two other problems in that same line of code: > >> > >> (1) hasVowel checks whether a vowel exists anywhere in the string, > >> not just in the next character, which is incorrect, and goes > >> against the step 1c directions above. (amplify would not be > >> properly stemmed to amplifi, for instance) > >> > >> (2) The check for the first letter is not performed (for words > >> like "by", etc) > >> > >> I've fixed both of those errors in the patch below: > >> > >> /* Step 1c */ > >> - if( z[0]=='y' && hasVowel(z+1) ){ > >> + if( z[0]=='y' && isConsonant(z+1) && z[2] ){ > >> z[0] = 'i'; > >> } > >> > >> _______________________________________________ > >> sqlite-users mailing list > >> sqlite-users@sqlite.org > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@sqlite.org > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > D. Richard Hipp > d...@hwaci.com > > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users