Hmm, and a clarification on the n-gram case ... there are no current plans to implement any n-gram capabilities in fts. This kind of thing has been discussed, but since it still seems like a nice-to-have type thing and not a must-have type thing, no time is being spent on it. I have somewhat of a suspicion that this kind of index requires a materially different model than fts has been using, which might encourage it to be a completely different virtual table.
-scott On 8/29/07, Scott Hess <[EMAIL PROTECTED]> wrote: > A primary constraint of the porter algorithm in fts is that it's > completely unencumbered open-source. That may-or-may-not make it a > great stemmer, of course :-). One of the reasons it's in there in the > first place is as an example of an alternative to the very basic > "simple" fts tokenizer. One of the near-term goals with Google Gears > is to improve the tokenizer, and that will probably extend benefits > out to fts (since Google Gears is also open-source). > > Thanks for the link, I'm always looking for reading material! > > As far as SQLite having inbuilt search, some projects (Google Gears, > for example) wanted to use SQLite for reasons other than fulltext > search. Rather than try to integrate two distinct projects, we > decided that it might be cleaner to just make one project a strict > subsidiary of the other. So you get fts basically for free once > you've integrated SQLite into your project. A side benefit is that > you don't have to make decisions about where to store your index data, > and there are no problems with making sure index data and database > data conform to the same transaction model, these things just happen > naturally. This will hopefully make fulltext search more applicable > in projects where searching is not the core functionality of the > project. > > -scott > > > On 8/29/07, Uma Krishnan <[EMAIL PROTECTED]> wrote: > > Hello Scott, > > > > I have several clarifications with respect to full text search. I'm a > > newbie in open source development, so please bear with me if some of the > > questions are irrelevant/obvious/nonsense. > > > > I was given to understand that the potter stemming algorithm implemented in > > fts2 is not robust enough (or rather snowball is more accurate). If fts2(or > > 3) has to be made more robust, then what should be the next step. The > > following url (I thought) gave the steps to follow rather succinctly: > > > > http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php > > > > At what stage would n-gram kick in (I assume n-gram would be in conjunction > > to snowball/potter). Which would be a good n-gram algorithm to implement. > > > > Finally, what's the rationale in having sqlite's own search. Why not use > > something like luceneC? > > > > Thanks in advance > > > > Uma > > > > Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in there. > > The main issue with Porter is > > that it's English only. > > > > There is no general game-plan for fuzzy search at this time, though if > > someone wants to step into the breech, go for it! Even a prototype > > which demonstrates the concepts and problems but isn't > > production-ready would be worth something. > > > > My current focus for the next generation is international support > > (this is more of a Google Gears project, but with focus on SQLite so > > there is likely to be stuff checked in on the SQLite side), and more > > scalable/manageable indexing. Not a lot of focus on things like > > quality and recall, mostly because I'm not aware of any major users > > with enough of an installed baseline to even generate decent metrics. > > [Basically, solving concrete identified problems rather than looking > > for ill-defined potential problems.] > > > > -scott > > > > > > On 8/24/07, Uma Krishnan wrote: > > > Would it not be more useful to first implement potter stemmer algorithm, > > > and then to implement n-gram (as I understand n-gram is for cross column > > > fuzzy search?). What is the general game plan for FTS3 with regard to > > > fuzzy search? > > > > > > Thanks in advance > > > > > > "Cesar D. Rodas" wrote: > > > On 23/08/07, Scott Hess wrote: > > > > On 8/20/07, Cesar D. Rodas wrote: > > > > > As I know ( I can be wrong ) SQLite Full Text Search is only match > > > > > with hole > > > > > words right? It could not be > > > > > And also no FT extension to db ( as far I know) is miss spell > > > > > tolerant, > > > > > > > > Yes, fts is matching exactly. There is some primitive support for > > > > English stemming using the Porter stemmer, but, honestly, it's not > > > > well-exercised. > > > > > > > > > And > > > > > I've found this Paper that talks about *Using Superimposed Coding Of > > > > > N-Gram > > > > > Lists For Efficient Inexact Matching* > > > > > > > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf > > > > > > > > > > I was reading and it is not so hard to implement, but it cost a extra > > > > > storage space, but I think the benefits are more. > > > > > > > > > > Also following this paper could be done a way to match with fragments > > > > > of > > > > > words... what do you think of it? > > > > > > > > It's an interesting paper, and I must say that anything which involves > > > > Bloom Filters automatically draws my attention :-). > > > > > > Yeah. I am doing some investigations about that, I love that too. And > > > I was watching that with n-grams you get a filter to stop common > > > words, and could be used as a stemming-like algorithm but independent > > > from the language. > > > > > > I was thinking to implement this > > > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html > > > when I finish up some things. What do you think of it? > > > > > > > While I think spelling-suggestion might be valuable for fts in the > > > > longer term, I'm not very enthusiastic about this particular model. > > > > It seems much more useful in the standard indexing model of building > > > > the index, manually tweaking it, and then doing a ton of queries > > > > against it. fts is really fairly constrained, because many use-cases > > > > are more along the lines of update the index quite a bit, and query it > > > > only a few times. > > > > > > > > Also, I think the concepts in the paper might have very significant > > > > problems handling Unicode, because the bit vectors will get so very > > > > large. I may be wrong, sometimes the overlapping-vector approach can > > > > have surprising relevance depending on the frequency distribution of > > > > the things in the vector. It would need some experimentation to > > > > figure that out. > > > > > > > > Certainly something to bookmark, though. > > > > > > > > Thanks, > > > > scott > > > > > > > > ----------------------------------------------------------------------------- > > > > To unsubscribe, send email to [EMAIL PROTECTED] > > > > ----------------------------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > -- > > > Cesar D. Rodas > > > http://www.cesarodas.com/ > > > Mobile Phone: 595 961 974165 > > > Phone: 595 21 645590 > > > [EMAIL PROTECTED] > > > [EMAIL PROTECTED] > > > > > > ----------------------------------------------------------------------------- > > > To unsubscribe, send email to [EMAIL PROTECTED] > > > ----------------------------------------------------------------------------- > > > > > > > > > > > > > ----------------------------------------------------------------------------- > > To unsubscribe, send email to [EMAIL PROTECTED] > > ----------------------------------------------------------------------------- > > > > > > > ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------