drh,

Thanks for the response: it's nice to know that the report was actually seen.

It would be hubris indeed to claim to fix an implementation bug in Porter's 
code. The code in sqlite didn't match any of Porter's code I could find, so I 
assumed it came from elsewhere: but maybe I missed something. In any event, the 
authorship wasn't clear to me from the sources. The real point that I had 
missed was that, as Shane Harrelson points out, step 1c changed between the 
original porter stemmer and the porter2 stemmer; the step I quote below, and 
which I "fixed", is in the porter2 algorithm, which in this case introduces an 
improvement from porter. So in essence I guess my patch moves porter a bit 
closer to porter2.

I understand the complication that changes to the stemmer would cause an 
incompatibility. It might be interesting to implement the porter2 algorithm for 
fts4; I'm not sure how the two compare in terms of performance. 

Thanks again,

James


On Feb 24, 2010, at 7:05 AM, D. Richard Hipp wrote:

> We got the Porter stemmer code directly from Martin Porter.
> 
> I'm sorry it does not work like you want it to.  Unfortunately, we  
> cannot change it now without introducing a serious incompatibility  
> with the millions and millions of applications already in the field  
> that are using the existing implementation.
> 
> FTS3 has a pluggable stemmer module.  You can write your own stemmer  
> that works "correctly" if you like, and link it in for use in your  
> applications.  We will also investigate making your recommended  
> changes for FTS4.  However, in order to maintain backwards  
> compatibility of FTS3, we cannot change the stemmer algorithm, even to  
> fix a "bug".
> 
> On Feb 24, 2010, at 9:59 AM, James Berry wrote:
> 
>> Can somebody please clarify the bug reporting process for sqlite? My  
>> understanding is that it's not possible to file bug reports  
>> directly, and that the advise is to write to the user list first.  
>> I've done that (below) but have no response so far and am concerned  
>> that this means the bug report will just be forgotten others, as  
>> well as by me.
>> 
>> How does this bug move from a message on a list to a ticket (and  
>> ultimately a patch, we hope) in the system?
>> 
>> James
>> 
>> On Feb 22, 2010, at 2:51 PM, James Berry wrote:
>> 
>>> I'm writing to report a bug in the porter-stemmer algorithm  
>>> supplied as part of the FTS3 implementation.
>>> 
>>> The stemmer has an inverted logic error that prevents it from  
>>> properly stemming words of the following form:
>>> 
>>>     dry -> dri
>>>     cry -> cri
>>> 
>>> This means, for instance, that the following words don't stem the  
>>> same:
>>> 
>>>     dried -> dri   -doesn't match-   dry
>>>     cried -> cry   -doesn't match-   cry
>>> 
>>> The bug seems to have been introduced as a simple logic error by  
>>> whoever wrote the stemmer code. The original description of step 1c  
>>> is here: http://snowball.tartarus.org/algorithms/english/stemmer.html
>>> 
>>>     Step 1c:
>>>             replace suffix y or Y by i if preceded by a non-vowel which is  
>>> not the first letter of the word (so cry -> cri, by -> by, say ->  
>>> say)
>>>     
>>> But the code in sqlite reads like this:
>>> 
>>> /* Step 1c */
>>> if( z[0]=='y' && hasVowel(z+1) ){
>>>  z[0] = 'i';
>>> }
>>> 
>>> In other words, sqlite turns the y into an i only if it is preceded  
>>> by a vowel (say -> sai), while the algorithm intends this to be  
>>> done if it is _not_ preceded by a vowel.
>>> 
>>> But there are two other problems in that same line of code:
>>> 
>>>     (1) hasVowel checks whether a vowel exists anywhere in the string,  
>>> not just in the next character, which is incorrect, and goes  
>>> against the step 1c directions above. (amplify would not be  
>>> properly stemmed to amplifi, for instance)
>>> 
>>>     (2) The check for the first letter is not performed (for words  
>>> like "by", etc)
>>> 
>>> I've fixed both of those errors in the patch below:
>>> 
>>> /* Step 1c */
>>> -  if( z[0]=='y' && hasVowel(z+1) ){
>>> + if( z[0]=='y' && isConsonant(z+1) && z[2] ){
>>>   z[0] = 'i';
>>> }
>>> 
>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users@sqlite.org
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> 
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
> D. Richard Hipp
> d...@hwaci.com
> 
> 
> 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to