Agreed, HUGE thanks for FTS. Hopefully my original post didn't come off 
ungrateful. I was just confused by limitations that looked like they could have 
been removed during the initial design (at least more easily than they can 
now.) Scott's reply helps me understand this better, and perhaps gives some 
starting points for finding a solution.

The idea of using the tokenizer output and doing a direct match is intriguing. 
A full content scan is expensive (that is the point of indexing,) but guess 
this is usually less expensive than a full index scan for single rows 
(especially for large indexes), and would eliminate the current limitations.

As far as continued development, there is a "tracker FTS" branch available that 
appears to be active. See 
http://git.gnome.org/cgit/tracker/tree/src/libtracker-fts. It looks like there 
is also continued active development on it: 
http://git.gnome.org/cgit/tracker/log/?qt=grep&q=FTS.

The tracker-fts code adds ranking and some other important functionality, but 
it is hard to separate from the rest of tracker. The tracker-fts files are 
public domain (SQLite license) but they have some dependencies on other parts 
of tracker that are not. Also, at least as of a few months ago, I think they 
were based on an earlier version of FTS3.

Supposing someone wanted to update FTS3, how would they get write access to the 
main code repository?

John

-----Original Message-----
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of P Kishor
Sent: Friday, October 16, 2009 4:23 PM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] Why FTS3 has the limitations it does

On Fri, Oct 16, 2009 at 3:12 PM, Scott Hess <sh...@google.com> wrote:
> On Wed, Oct 14, 2009 at 11:35 PM, John Crenshaw
> <johncrens...@priacta.com> wrote:
>> The severe limitations on FTS3 seemed odd to me, but I figured I could
>> live with them. Then I starting finding that various queries were giving
>> strange "out of context" errors with the MATCH operator, even though I
>> was following all the documented rules. As a result I started looking
>> deeply into what is going on with FTS3 and I found something that
>> bothers me.
>>
>> These limitations are really completely arbitrary. They should be
>> removable.
>
> fts is mostly the way it is because that was the amount that got done
> before I lost the motivation to carry it further.  The set of possible
> improvements is vast, but they need a motivated party to carry them
> forward.  Some of the integration with SQLite is the way it is mostly
> because it was decided to keep fts outside of SQLite core.  Feel free
> to dive in and improve it.
>
>> You can only use a single index to query a table, after that everything
>> else has to be done with a scan of the results, fair enough. But with
>> FTS3, the match operator works ONLY when the match expression is
>> selected for the index. This means that if a query could allow a row to
>> be selected by either rowid, or a MATCH expression, you can have a
>> problem. If the rowid is selected for use as the index, the MATCH won't
>> be used as the index, and you get errors. Similarly, a query with two
>> MATCH expressions will only be able to use one as the index, so you get
>> errors from the second.
>
> The MATCH code probes term->doclist, there is no facility for probing
> by docid.  At minimum the document will need to be tokenized.
> Worst-case, you could tokenize it to an in-memory segment and probe
> that, which would make good re-use of existing code.  Most efficient
> would be to somehow match directly against the tokenizer output (you
> could look at the snippeting code for hints there).
>
>> My first question is, why was FTS designed like this in the first place?
>
> Because running MATCH against a subset of the table was not considered
> an important use case when designing it?
>
>> Surely this was clear during the design stage, when the design could
>> have been easily changed to accommodate the lookups required for a MATCH
>> function. Is there some compelling performance benefit? Something I
>> missed?
>
> "Easily" is all relative.  There were plenty of hard problems to be
> solved without looking around for a bunch of easy ones to tack on.
>
>> My second question is, can we expect this to change at some point?
>
> Probably not unless someone out there decides to.  I got kind of
> burned out on fts about a year back.


With immense gratitude expressed here to Scott, I feel a bit
disappointed that FTS has fallen out of the core, and out of
"continued development and improvement." It is really a brilliant
piece of work that makes sqlite eminently more usable for a number of
tasks. I was setting up a "tagging" system, and what a nightmare that
was until I realized that I don't have to develop one. I can just use
FTS! It sort of fits right in Google's philosophy that I summarized in
my write-up a while back

http://punkish.org/Why-File-When-You-Can-Full-Text-Search


I don't have the smarts to work on FTS, only the smarts to realize
that it is a great thing to use. I hope someone will pick it up and
that eventually we will have FTS4. Makes life so much easier instead
of dicking around with external search mechanisms such as Lucene or
swish-e or htdig, etc.




>
>> All that is needed is the ability to lookup by a combination
>> of docid and term. Isn't a hash already built while creating a list of
>> terms for storage? What if that hash were stored, indexed by docid?
>
> In database world, space==time.  Storing more data means the system gets 
> slower.
>
> -scott
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Puneet Kishor http://www.punkish.org
Carbon Model http://carbonmodel.org
Charter Member, Open Source Geospatial Foundation http://www.osgeo.org
Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor
Nelson Institute, UW-Madison http://www.nelson.wisc.edu
-----------------------------------------------------------------------
Assertions are politics; backing up assertions with evidence is science
=======================================================================
Sent from Madison, WI, United States
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to