[sqlite] FTS statistics and stemming

Stephen Woodbridge Sat, 05 Jul 2008 20:31:07 -0700

Hi,

First let me say that FTS3 is really awesome. This is my first 
experience playing with FTS and it works very nicely with the PORTER 
stemming.


My particular use for FTS is not document text but addresses and it 
would be very useful if there were a way to analyze the FTS index to get 
statistics on the keys. I could then use this information to make a 
custom parser/stemmer that could eliminate stop words.

For example, Rd, road, st, street, etc would be overly represented and 
not very discriminating, so these should/could be removed. Ideally this 
list should be generated based on loading the data, the analyzing the 
index, then updating the stemmer to remove the new stop works and again 
analyzing and adjusting if needed.

Is this possible? How?

If I had to code this where would I start, I would like to get a list of 
the keys and a count of how many rows that a given key is represented 
in. I assume a token that is represented multiple times in a document is 
represented by a list of offsets, so I can also get a count of the 
number of time it show in each document somehow. I think I have figured 
this much out by reading all the posts on FTS in the archive.

Thanks,
   -Steve
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] FTS statistics and stemming

Reply via email to