Re: [sqlite] FTS2 suggestion

Scott Hess Wed, 29 Aug 2007 10:40:57 -0700

Hmm, and a clarification on the n-gram case ... there are no current
plans to implement any n-gram capabilities in fts.  This kind of thing
has been discussed, but since it still seems like a nice-to-have type
thing and not a must-have type thing, no time is being spent on it.  I
have somewhat of a suspicion that this kind of index requires a
materially different model than fts has been using, which might
encourage it to be a completely different virtual table.


-scott


On 8/29/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> A primary constraint of the porter algorithm in fts is that it's
> completely unencumbered open-source.  That may-or-may-not make it a
> great stemmer, of course :-).  One of the reasons it's in there in the
> first place is as an example of an alternative to the very basic
> "simple" fts tokenizer.  One of the near-term goals with Google Gears
> is to improve the tokenizer, and that will probably extend benefits
> out to fts (since Google Gears is also open-source).
>
> Thanks for the link, I'm always looking for reading material!
>
> As far as SQLite having inbuilt search, some projects (Google Gears,
> for example) wanted to use SQLite for reasons other than fulltext
> search.  Rather than try to integrate two distinct projects, we
> decided that it might be cleaner to just make one project a strict
> subsidiary of the other.  So you get fts basically for free once
> you've integrated SQLite into your project.  A side benefit is that
> you don't have to make decisions about where to store your index data,
> and there are no problems with making sure index data and database
> data conform to the same transaction model, these things just happen
> naturally.  This will hopefully make fulltext search more applicable
> in projects where searching is not the core functionality of the
> project.
>
> -scott
>
>
> On 8/29/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> > Hello Scott,
> >
> > I have several clarifications with respect to full text search. I'm a 
> > newbie in open source development, so please bear with me if some of the 
> > questions are irrelevant/obvious/nonsense.
> >
> > I was given to understand that the potter stemming algorithm implemented in 
> > fts2 is not robust enough (or rather snowball is more accurate). If fts2(or 
> > 3) has to be made more robust, then what should be the next step. The 
> > following url (I thought) gave the steps to follow rather succinctly:
> >
> > http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php
> >
> > At what stage would n-gram kick in (I assume n-gram would be in conjunction 
> > to snowball/potter). Which would be a good n-gram algorithm to implement.
> >
> > Finally, what's the rationale in having sqlite's own search. Why not use 
> > something like luceneC?
> >
> > Thanks in advance
> >
> > Uma
> >
> > Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in there.  
> > The main issue with Porter is
> > that it's English only.
> >
> > There is no general game-plan for fuzzy search at this time, though if
> > someone wants to step into the breech, go for it!  Even a prototype
> > which demonstrates the concepts and problems but isn't
> > production-ready would be worth something.
> >
> > My current focus for the next generation is international support
> > (this is more of a Google Gears project, but with focus on SQLite so
> > there is likely to be stuff checked in on the SQLite side), and more
> > scalable/manageable indexing.  Not a lot of focus on things like
> > quality and recall, mostly because I'm not aware of any major users
> > with enough of an installed baseline to even generate decent metrics.
> > [Basically, solving concrete identified problems rather than looking
> > for ill-defined potential problems.]
> >
> > -scott
> >
> >
> > On 8/24/07, Uma Krishnan  wrote:
> > > Would it not be more useful to first implement potter stemmer algorithm, 
> > > and then to implement n-gram (as I understand n-gram is for cross column 
> > > fuzzy search?). What is the general game plan for FTS3 with regard to 
> > > fuzzy search?
> > >
> > >   Thanks in advance
> > >
> > > "Cesar D. Rodas"  wrote:
> > >   On 23/08/07, Scott Hess wrote:
> > > > On 8/20/07, Cesar D. Rodas wrote:
> > > > > As I know ( I can be wrong ) SQLite Full Text Search is only match 
> > > > > with hole
> > > > > words right? It could not be
> > > > > And also no FT extension to db ( as far I know) is miss spell 
> > > > > tolerant,
> > > >
> > > > Yes, fts is matching exactly. There is some primitive support for
> > > > English stemming using the Porter stemmer, but, honestly, it's not
> > > > well-exercised.
> > > >
> > > > > And
> > > > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > > > N-Gram
> > > > > Lists For Efficient Inexact Matching*
> > > >
> > > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > > >
> > > > > I was reading and it is not so hard to implement, but it cost a extra
> > > > > storage space, but I think the benefits are more.
> > > > >
> > > > > Also following this paper could be done a way to match with fragments 
> > > > > of
> > > > > words... what do you think of it?
> > > >
> > > > It's an interesting paper, and I must say that anything which involves
> > > > Bloom Filters automatically draws my attention :-).
> > >
> > > Yeah. I am doing some investigations about that, I love that too. And
> > > I was watching that with n-grams you get a filter to stop common
> > > words, and could be used as a stemming-like algorithm but independent
> > > from the language.
> > >
> > > I was thinking to implement this
> > > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> > > when I finish up some things. What do you think of it?
> > >
> > > > While I think spelling-suggestion might be valuable for fts in the
> > > > longer term, I'm not very enthusiastic about this particular model.
> > > > It seems much more useful in the standard indexing model of building
> > > > the index, manually tweaking it, and then doing a ton of queries
> > > > against it. fts is really fairly constrained, because many use-cases
> > > > are more along the lines of update the index quite a bit, and query it
> > > > only a few times.
> > > >
> > > > Also, I think the concepts in the paper might have very significant
> > > > problems handling Unicode, because the bit vectors will get so very
> > > > large. I may be wrong, sometimes the overlapping-vector approach can
> > > > have surprising relevance depending on the frequency distribution of
> > > > the things in the vector. It would need some experimentation to
> > > > figure that out.
> > > >
> > > > Certainly something to bookmark, though.
> > > >
> > > > Thanks,
> > > > scott
> > > >
> > > > -----------------------------------------------------------------------------
> > > > To unsubscribe, send email to [EMAIL PROTECTED]
> > > > -----------------------------------------------------------------------------
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cesar D. Rodas
> > > http://www.cesarodas.com/
> > > Mobile Phone: 595 961 974165
> > > Phone: 595 21 645590
> > > [EMAIL PROTECTED]
> > > [EMAIL PROTECTED]
> > >
> > > -----------------------------------------------------------------------------
> > > To unsubscribe, send email to [EMAIL PROTECTED]
> > > -----------------------------------------------------------------------------
> > >
> > >
> > >
> >
> > -----------------------------------------------------------------------------
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > -----------------------------------------------------------------------------
> >
> >
> >
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] FTS2 suggestion

Reply via email to