N-gram is a sequense of N Letters of a word or set of words... http://en.wikipedia.org/wiki/N-gram
On 29/08/2007, Uma Krishnan <[EMAIL PROTECTED]> wrote: > > Hello Scott, > > I have several clarifications with respect to full text search. I'm a > newbie in open source development, so please bear with me if some of the > questions are irrelevant/obvious/nonsense. > > I was given to understand that the potter stemming algorithm implemented > in fts2 is not robust enough (or rather snowball is more accurate). If > fts2(or 3) has to be made more robust, then what should be the next step. > The following url (I thought) gave the steps to follow rather succinctly: > > > http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php > > At what stage would n-gram kick in (I assume n-gram would be in > conjunction to snowball/potter). Which would be a good n-gram algorithm to > implement. > > Finally, what's the rationale in having sqlite's own search. Why not use > something like luceneC? > > Thanks in advance > > Uma > > Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in > there. The main issue with Porter is > that it's English only. > > There is no general game-plan for fuzzy search at this time, though if > someone wants to step into the breech, go for it! Even a prototype > which demonstrates the concepts and problems but isn't > production-ready would be worth something. > > My current focus for the next generation is international support > (this is more of a Google Gears project, but with focus on SQLite so > there is likely to be stuff checked in on the SQLite side), and more > scalable/manageable indexing. Not a lot of focus on things like > quality and recall, mostly because I'm not aware of any major users > with enough of an installed baseline to even generate decent metrics. > [Basically, solving concrete identified problems rather than looking > for ill-defined potential problems.] > > -scott > > > On 8/24/07, Uma Krishnan wrote: > > Would it not be more useful to first implement potter stemmer algorithm, > and then to implement n-gram (as I understand n-gram is for cross column > fuzzy search?). What is the general game plan for FTS3 with regard to fuzzy > search? > > > > Thanks in advance > > > > "Cesar D. Rodas" wrote: > > On 23/08/07, Scott Hess wrote: > > > On 8/20/07, Cesar D. Rodas wrote: > > > > As I know ( I can be wrong ) SQLite Full Text Search is only match > with hole > > > > words right? It could not be > > > > And also no FT extension to db ( as far I know) is miss spell > tolerant, > > > > > > Yes, fts is matching exactly. There is some primitive support for > > > English stemming using the Porter stemmer, but, honestly, it's not > > > well-exercised. > > > > > > > And > > > > I've found this Paper that talks about *Using Superimposed Coding Of > N-Gram > > > > Lists For Efficient Inexact Matching* > > > > > > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf > > > > > > > > I was reading and it is not so hard to implement, but it cost a > extra > > > > storage space, but I think the benefits are more. > > > > > > > > Also following this paper could be done a way to match with > fragments of > > > > words... what do you think of it? > > > > > > It's an interesting paper, and I must say that anything which involves > > > Bloom Filters automatically draws my attention :-). > > > > Yeah. I am doing some investigations about that, I love that too. And > > I was watching that with n-grams you get a filter to stop common > > words, and could be used as a stemming-like algorithm but independent > > from the language. > > > > I was thinking to implement this > > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html > > when I finish up some things. What do you think of it? > > > > > While I think spelling-suggestion might be valuable for fts in the > > > longer term, I'm not very enthusiastic about this particular model. > > > It seems much more useful in the standard indexing model of building > > > the index, manually tweaking it, and then doing a ton of queries > > > against it. fts is really fairly constrained, because many use-cases > > > are more along the lines of update the index quite a bit, and query it > > > only a few times. > > > > > > Also, I think the concepts in the paper might have very significant > > > problems handling Unicode, because the bit vectors will get so very > > > large. I may be wrong, sometimes the overlapping-vector approach can > > > have surprising relevance depending on the frequency distribution of > > > the things in the vector. It would need some experimentation to > > > figure that out. > > > > > > Certainly something to bookmark, though. > > > > > > Thanks, > > > scott > > > > > > > ----------------------------------------------------------------------------- > > > To unsubscribe, send email to [EMAIL PROTECTED] > > > > ----------------------------------------------------------------------------- > > > > > > > > > > > > > > -- > > Cesar D. Rodas > > http://www.cesarodas.com/ > > Mobile Phone: 595 961 974165 > > Phone: 595 21 645590 > > [EMAIL PROTECTED] > > [EMAIL PROTECTED] > > > > > ----------------------------------------------------------------------------- > > To unsubscribe, send email to [EMAIL PROTECTED] > > > ----------------------------------------------------------------------------- > > > > > > > > > ----------------------------------------------------------------------------- > To unsubscribe, send email to [EMAIL PROTECTED] > > ----------------------------------------------------------------------------- > > > -- Cesar D. Rodas http://www.cesarodas.com/ Mobile Phone: 595 961 974165 Phone: 595 21 645590 [EMAIL PROTECTED] [EMAIL PROTECTED]