Re: [sqlite] Bug in porter stemmer

2010-02-24 Thread James Berry
Can somebody please clarify the bug reporting process for sqlite? My 
understanding is that it's not possible to file bug reports directly, and that 
the advise is to write to the user list first. I've done that (below) but have 
no response so far and am concerned that this means the bug report will just be 
forgotten others, as well as by me.

How does this bug move from a message on a list to a ticket (and ultimately a 
patch, we hope) in the system?

James

On Feb 22, 2010, at 2:51 PM, James Berry wrote:

 I'm writing to report a bug in the porter-stemmer algorithm supplied as part 
 of the FTS3 implementation.
 
 The stemmer has an inverted logic error that prevents it from properly 
 stemming words of the following form:
 
   dry - dri
   cry - cri
 
 This means, for instance, that the following words don't stem the same:
 
   dried - dri   -doesn't match-   dry
   cried - cry   -doesn't match-   cry
 
 The bug seems to have been introduced as a simple logic error by whoever 
 wrote the stemmer code. The original description of step 1c is here: 
 http://snowball.tartarus.org/algorithms/english/stemmer.html
 
   Step 1c:
   replace suffix y or Y by i if preceded by a non-vowel which is 
 not the first letter of the word (so cry - cri, by - by, say - say)
   
 But the code in sqlite reads like this:
 
  /* Step 1c */
  if( z[0]=='y'  hasVowel(z+1) ){
z[0] = 'i';
  }
 
 In other words, sqlite turns the y into an i only if it is preceded by a 
 vowel (say - sai), while the algorithm intends this to be done if it is 
 _not_ preceded by a vowel.
 
 But there are two other problems in that same line of code:
 
   (1) hasVowel checks whether a vowel exists anywhere in the string, not 
 just in the next character, which is incorrect, and goes against the step 1c 
 directions above. (amplify would not be properly stemmed to amplifi, for 
 instance)
 
   (2) The check for the first letter is not performed (for words like 
 by, etc)
 
 I've fixed both of those errors in the patch below:
 
   /* Step 1c */
 -  if( z[0]=='y'  hasVowel(z+1) ){
 + if( z[0]=='y'  isConsonant(z+1)  z[2] ){
 z[0] = 'i';
   }
 
 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Bug in porter stemmer

2010-02-24 Thread D. Richard Hipp
We got the Porter stemmer code directly from Martin Porter.

I'm sorry it does not work like you want it to.  Unfortunately, we  
cannot change it now without introducing a serious incompatibility  
with the millions and millions of applications already in the field  
that are using the existing implementation.

FTS3 has a pluggable stemmer module.  You can write your own stemmer  
that works correctly if you like, and link it in for use in your  
applications.  We will also investigate making your recommended  
changes for FTS4.  However, in order to maintain backwards  
compatibility of FTS3, we cannot change the stemmer algorithm, even to  
fix a bug.

On Feb 24, 2010, at 9:59 AM, James Berry wrote:

 Can somebody please clarify the bug reporting process for sqlite? My  
 understanding is that it's not possible to file bug reports  
 directly, and that the advise is to write to the user list first.  
 I've done that (below) but have no response so far and am concerned  
 that this means the bug report will just be forgotten others, as  
 well as by me.

 How does this bug move from a message on a list to a ticket (and  
 ultimately a patch, we hope) in the system?

 James

 On Feb 22, 2010, at 2:51 PM, James Berry wrote:

 I'm writing to report a bug in the porter-stemmer algorithm  
 supplied as part of the FTS3 implementation.

 The stemmer has an inverted logic error that prevents it from  
 properly stemming words of the following form:

  dry - dri
  cry - cri

 This means, for instance, that the following words don't stem the  
 same:

  dried - dri   -doesn't match-   dry
  cried - cry   -doesn't match-   cry

 The bug seems to have been introduced as a simple logic error by  
 whoever wrote the stemmer code. The original description of step 1c  
 is here: http://snowball.tartarus.org/algorithms/english/stemmer.html

  Step 1c:
  replace suffix y or Y by i if preceded by a non-vowel which is  
 not the first letter of the word (so cry - cri, by - by, say -  
 say)
  
 But the code in sqlite reads like this:

 /* Step 1c */
 if( z[0]=='y'  hasVowel(z+1) ){
   z[0] = 'i';
 }

 In other words, sqlite turns the y into an i only if it is preceded  
 by a vowel (say - sai), while the algorithm intends this to be  
 done if it is _not_ preceded by a vowel.

 But there are two other problems in that same line of code:

  (1) hasVowel checks whether a vowel exists anywhere in the string,  
 not just in the next character, which is incorrect, and goes  
 against the step 1c directions above. (amplify would not be  
 properly stemmed to amplifi, for instance)

  (2) The check for the first letter is not performed (for words  
 like by, etc)

 I've fixed both of those errors in the patch below:

  /* Step 1c */
 -  if( z[0]=='y'  hasVowel(z+1) ){
 + if( z[0]=='y'  isConsonant(z+1)  z[2] ){
z[0] = 'i';
  }

 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

D. Richard Hipp
d...@hwaci.com



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Bug in porter stemmer

2010-02-24 Thread Shane Harrelson
Additionally, your algorithm reference for step1c is from the Snowball
English (Porter2) algorithm.
The implementation used in SQLite is for the original Porter algorithm
discussed here:
http://tartarus.org/~martin/PorterStemmer/

HTH.
-SHane



On Wed, Feb 24, 2010 at 10:05 AM, D. Richard Hipp d...@hwaci.com wrote:

 We got the Porter stemmer code directly from Martin Porter.

 I'm sorry it does not work like you want it to.  Unfortunately, we
 cannot change it now without introducing a serious incompatibility
 with the millions and millions of applications already in the field
 that are using the existing implementation.

 FTS3 has a pluggable stemmer module.  You can write your own stemmer
 that works correctly if you like, and link it in for use in your
 applications.  We will also investigate making your recommended
 changes for FTS4.  However, in order to maintain backwards
 compatibility of FTS3, we cannot change the stemmer algorithm, even to
 fix a bug.

 On Feb 24, 2010, at 9:59 AM, James Berry wrote:

  Can somebody please clarify the bug reporting process for sqlite? My
  understanding is that it's not possible to file bug reports
  directly, and that the advise is to write to the user list first.
  I've done that (below) but have no response so far and am concerned
  that this means the bug report will just be forgotten others, as
  well as by me.
 
  How does this bug move from a message on a list to a ticket (and
  ultimately a patch, we hope) in the system?
 
  James
 
  On Feb 22, 2010, at 2:51 PM, James Berry wrote:
 
  I'm writing to report a bug in the porter-stemmer algorithm
  supplied as part of the FTS3 implementation.
 
  The stemmer has an inverted logic error that prevents it from
  properly stemming words of the following form:
 
   dry - dri
   cry - cri
 
  This means, for instance, that the following words don't stem the
  same:
 
   dried - dri   -doesn't match-   dry
   cried - cry   -doesn't match-   cry
 
  The bug seems to have been introduced as a simple logic error by
  whoever wrote the stemmer code. The original description of step 1c
  is here: http://snowball.tartarus.org/algorithms/english/stemmer.html
 
   Step 1c:
   replace suffix y or Y by i if preceded by a non-vowel which
 is
  not the first letter of the word (so cry - cri, by - by, say -
  say)
 
  But the code in sqlite reads like this:
 
  /* Step 1c */
  if( z[0]=='y'  hasVowel(z+1) ){
z[0] = 'i';
  }
 
  In other words, sqlite turns the y into an i only if it is preceded
  by a vowel (say - sai), while the algorithm intends this to be
  done if it is _not_ preceded by a vowel.
 
  But there are two other problems in that same line of code:
 
   (1) hasVowel checks whether a vowel exists anywhere in the string,
  not just in the next character, which is incorrect, and goes
  against the step 1c directions above. (amplify would not be
  properly stemmed to amplifi, for instance)
 
   (2) The check for the first letter is not performed (for words
  like by, etc)
 
  I've fixed both of those errors in the patch below:
 
   /* Step 1c */
  -  if( z[0]=='y'  hasVowel(z+1) ){
  + if( z[0]=='y'  isConsonant(z+1)  z[2] ){
 z[0] = 'i';
   }
 
  ___
  sqlite-users mailing list
  sqlite-users@sqlite.org
  http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
 
  ___
  sqlite-users mailing list
  sqlite-users@sqlite.org
  http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

 D. Richard Hipp
 d...@hwaci.com



 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Bug in porter stemmer

2010-02-24 Thread James Berry
drh,

Thanks for the response: it's nice to know that the report was actually seen.

It would be hubris indeed to claim to fix an implementation bug in Porter's 
code. The code in sqlite didn't match any of Porter's code I could find, so I 
assumed it came from elsewhere: but maybe I missed something. In any event, the 
authorship wasn't clear to me from the sources. The real point that I had 
missed was that, as Shane Harrelson points out, step 1c changed between the 
original porter stemmer and the porter2 stemmer; the step I quote below, and 
which I fixed, is in the porter2 algorithm, which in this case introduces an 
improvement from porter. So in essence I guess my patch moves porter a bit 
closer to porter2.

I understand the complication that changes to the stemmer would cause an 
incompatibility. It might be interesting to implement the porter2 algorithm for 
fts4; I'm not sure how the two compare in terms of performance. 

Thanks again,

James


On Feb 24, 2010, at 7:05 AM, D. Richard Hipp wrote:

 We got the Porter stemmer code directly from Martin Porter.
 
 I'm sorry it does not work like you want it to.  Unfortunately, we  
 cannot change it now without introducing a serious incompatibility  
 with the millions and millions of applications already in the field  
 that are using the existing implementation.
 
 FTS3 has a pluggable stemmer module.  You can write your own stemmer  
 that works correctly if you like, and link it in for use in your  
 applications.  We will also investigate making your recommended  
 changes for FTS4.  However, in order to maintain backwards  
 compatibility of FTS3, we cannot change the stemmer algorithm, even to  
 fix a bug.
 
 On Feb 24, 2010, at 9:59 AM, James Berry wrote:
 
 Can somebody please clarify the bug reporting process for sqlite? My  
 understanding is that it's not possible to file bug reports  
 directly, and that the advise is to write to the user list first.  
 I've done that (below) but have no response so far and am concerned  
 that this means the bug report will just be forgotten others, as  
 well as by me.
 
 How does this bug move from a message on a list to a ticket (and  
 ultimately a patch, we hope) in the system?
 
 James
 
 On Feb 22, 2010, at 2:51 PM, James Berry wrote:
 
 I'm writing to report a bug in the porter-stemmer algorithm  
 supplied as part of the FTS3 implementation.
 
 The stemmer has an inverted logic error that prevents it from  
 properly stemming words of the following form:
 
 dry - dri
 cry - cri
 
 This means, for instance, that the following words don't stem the  
 same:
 
 dried - dri   -doesn't match-   dry
 cried - cry   -doesn't match-   cry
 
 The bug seems to have been introduced as a simple logic error by  
 whoever wrote the stemmer code. The original description of step 1c  
 is here: http://snowball.tartarus.org/algorithms/english/stemmer.html
 
 Step 1c:
 replace suffix y or Y by i if preceded by a non-vowel which is  
 not the first letter of the word (so cry - cri, by - by, say -  
 say)
 
 But the code in sqlite reads like this:
 
 /* Step 1c */
 if( z[0]=='y'  hasVowel(z+1) ){
  z[0] = 'i';
 }
 
 In other words, sqlite turns the y into an i only if it is preceded  
 by a vowel (say - sai), while the algorithm intends this to be  
 done if it is _not_ preceded by a vowel.
 
 But there are two other problems in that same line of code:
 
 (1) hasVowel checks whether a vowel exists anywhere in the string,  
 not just in the next character, which is incorrect, and goes  
 against the step 1c directions above. (amplify would not be  
 properly stemmed to amplifi, for instance)
 
 (2) The check for the first letter is not performed (for words  
 like by, etc)
 
 I've fixed both of those errors in the patch below:
 
 /* Step 1c */
 -  if( z[0]=='y'  hasVowel(z+1) ){
 + if( z[0]=='y'  isConsonant(z+1)  z[2] ){
   z[0] = 'i';
 }
 
 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
 
 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
 
 D. Richard Hipp
 d...@hwaci.com
 
 
 
 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Bug in porter stemmer

2010-02-24 Thread Scott Hess
Actually, I think a new version of the tokenizer would have to be a
distinct tokenizer (ie, porter versus porter1 versus porter2,
whatever).  fts4 should not interpret the meaning of an explicit
tokenizer differently from fts3, but it could use a different default
tokenizer.

[Don't take this as gospel, but that's my understanding of the lay of the land.]

-scott


On Wed, Feb 24, 2010 at 10:28 AM, James Berry ja...@jberry.us wrote:
 drh,

 Thanks for the response: it's nice to know that the report was actually seen.

 It would be hubris indeed to claim to fix an implementation bug in Porter's 
 code. The code in sqlite didn't match any of Porter's code I could find, so I 
 assumed it came from elsewhere: but maybe I missed something. In any event, 
 the authorship wasn't clear to me from the sources. The real point that I had 
 missed was that, as Shane Harrelson points out, step 1c changed between the 
 original porter stemmer and the porter2 stemmer; the step I quote below, and 
 which I fixed, is in the porter2 algorithm, which in this case introduces 
 an improvement from porter. So in essence I guess my patch moves porter a bit 
 closer to porter2.

 I understand the complication that changes to the stemmer would cause an 
 incompatibility. It might be interesting to implement the porter2 algorithm 
 for fts4; I'm not sure how the two compare in terms of performance.

 Thanks again,

 James


 On Feb 24, 2010, at 7:05 AM, D. Richard Hipp wrote:

 We got the Porter stemmer code directly from Martin Porter.

 I'm sorry it does not work like you want it to.  Unfortunately, we
 cannot change it now without introducing a serious incompatibility
 with the millions and millions of applications already in the field
 that are using the existing implementation.

 FTS3 has a pluggable stemmer module.  You can write your own stemmer
 that works correctly if you like, and link it in for use in your
 applications.  We will also investigate making your recommended
 changes for FTS4.  However, in order to maintain backwards
 compatibility of FTS3, we cannot change the stemmer algorithm, even to
 fix a bug.

 On Feb 24, 2010, at 9:59 AM, James Berry wrote:

 Can somebody please clarify the bug reporting process for sqlite? My
 understanding is that it's not possible to file bug reports
 directly, and that the advise is to write to the user list first.
 I've done that (below) but have no response so far and am concerned
 that this means the bug report will just be forgotten others, as
 well as by me.

 How does this bug move from a message on a list to a ticket (and
 ultimately a patch, we hope) in the system?

 James

 On Feb 22, 2010, at 2:51 PM, James Berry wrote:

 I'm writing to report a bug in the porter-stemmer algorithm
 supplied as part of the FTS3 implementation.

 The stemmer has an inverted logic error that prevents it from
 properly stemming words of the following form:

     dry - dri
     cry - cri

 This means, for instance, that the following words don't stem the
 same:

     dried - dri   -doesn't match-   dry
     cried - cry   -doesn't match-   cry

 The bug seems to have been introduced as a simple logic error by
 whoever wrote the stemmer code. The original description of step 1c
 is here: http://snowball.tartarus.org/algorithms/english/stemmer.html

     Step 1c:
             replace suffix y or Y by i if preceded by a non-vowel which is
 not the first letter of the word (so cry - cri, by - by, say -
 say)

 But the code in sqlite reads like this:

 /* Step 1c */
 if( z[0]=='y'  hasVowel(z+1) ){
  z[0] = 'i';
 }

 In other words, sqlite turns the y into an i only if it is preceded
 by a vowel (say - sai), while the algorithm intends this to be
 done if it is _not_ preceded by a vowel.

 But there are two other problems in that same line of code:

     (1) hasVowel checks whether a vowel exists anywhere in the string,
 not just in the next character, which is incorrect, and goes
 against the step 1c directions above. (amplify would not be
 properly stemmed to amplifi, for instance)

     (2) The check for the first letter is not performed (for words
 like by, etc)

 I've fixed both of those errors in the patch below:

 /* Step 1c */
 -  if( z[0]=='y'  hasVowel(z+1) ){
 + if( z[0]=='y'  isConsonant(z+1)  z[2] ){
   z[0] = 'i';
 }

 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

 D. Richard Hipp
 d...@hwaci.com



 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

 ___
 sqlite-users mailing list
 sqlite-users@sqlite.org
 

[sqlite] Bug in porter stemmer

2010-02-22 Thread James Berry
I'm writing to report a bug in the porter-stemmer algorithm supplied as part of 
the FTS3 implementation.

The stemmer has an inverted logic error that prevents it from properly stemming 
words of the following form:

dry - dri
cry - cri

This means, for instance, that the following words don't stem the same:

dried - dri   -doesn't match-   dry
cried - cry   -doesn't match-   cry

The bug seems to have been introduced as a simple logic error by whoever wrote 
the stemmer code. The original description of step 1c is here: 
http://snowball.tartarus.org/algorithms/english/stemmer.html

Step 1c:
replace suffix y or Y by i if preceded by a non-vowel which is 
not the first letter of the word (so cry - cri, by - by, say - say)

But the code in sqlite reads like this:

  /* Step 1c */
  if( z[0]=='y'  hasVowel(z+1) ){
z[0] = 'i';
  }

In other words, sqlite turns the y into an i only if it is preceded by a vowel 
(say - sai), while the algorithm intends this to be done if it is _not_ 
preceded by a vowel.

But there are two other problems in that same line of code:

(1) hasVowel checks whether a vowel exists anywhere in the string, not 
just in the next character, which is incorrect, and goes against the step 1c 
directions above. (amplify would not be properly stemmed to amplifi, for 
instance)

(2) The check for the first letter is not performed (for words like 
by, etc)

I've fixed both of those errors in the patch below:

   /* Step 1c */
-  if( z[0]=='y'  hasVowel(z+1) ){
+ if( z[0]=='y'  isConsonant(z+1)  z[2] ){
 z[0] = 'i';
   }

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users