[sqlite] FTS3

Martin Pfeifle Tue, 02 Jun 2009 11:00:59 -0700

Some further question regarding FTS3.

Am I correct that a doclist of a certain term is never split onto two blocks 
(BLOBs)?
Can we somehow limit the size of such BLOBs?
I did some tests where I inserted Millions of addresses into FTS3 and all 
contained a certain term.
I ended up with some Blobs bigger than 1MByte.
Can I somehow avoid this?
Best Martin

________________________________
Von: Martin Pfeifle <martinpfei...@yahoo.de>
An: General Discussion of SQLite Database <sqlite-users@sqlite.org>
Gesendet: Freitag, den 29. Mai 2009, 08:59:45 Uhr
Betreff: Re: [sqlite] FTS3

One further question:

In fts3.c, a comment is found which describes the file format dependent on the 
different compiler settings.
* Result formats differ with the setting of DL_DEFAULTS.  Examples:
**
** DL_DOCIDS: [1] [3] [7]
** DL_POSITIONS: [1 0[0 4] 1[17]] [3 1[5]]
** DL_POSITIONS_OFFSETS: [1 0[0,0,3 4,23,26] 1[17,102,105]] [3 1[5,20,23]]

I also found one functional limitation if we use only  DL_DOCIDS, in order to 
reduce the overall size.

/*
** By default, only positions and not offsets are stored in the doclists.
** To change this so that offsets are stored too, compile with
**
**          -DDL_DEFAULT=DL_POSITIONS_OFFSETS
**
** If DL_DEFAULT is set to DL_DOCIDS, your table can only be inserted
** into (no deletes or updates).
*/

Are there any other functional drawbacks if we go for DOCIDS only, e.g. search 
for "term1 term2" in a document?

Best Martin

________________________________
Von: D. Richard Hipp <d...@hwaci.com>
An: General Discussion of SQLite Database <sqlite-users@sqlite.org>
Gesendet: Dienstag, den 26. Mai 2009, 12:27:59 Uhr
Betreff: Re: [sqlite] FTS3

On May 26, 2009, at 5:03 AM, Martin Pfeifle wrote:

> Dear all,
> we need full and fuzzy text search for addresses.
> Currently we are looking into Lucene and SQLite's FTS extension.
> For us it is crucial to understand the file structures and the  
> concepts behind the libraries.
> Is there a self-contained, comprehensive document for FTS3 (besides  
> the comments in fts3.c) ?

There is no information on FTS3 apart from the code comments and the  
README files in the source tree.

The file formats for FTS3 and lucene are completely different at the  
byte level.  But if you dig deeper, you will find that they both use  
the same underlying concepts and ideas and really are two different  
implementations of the same algorithm.  During development, we were  
constantly testing the performance and index size of FTS3 against  
CLucene using the Enron email corpus.  Our goal was for FTS3 to run  
significantly faster than CLucene and to generate an index that was no  
larger in size.  That goal was easily met at the time, though we have  
not tested FTS3 against CLucene lately to see if anything has changed.

One of the issues with CLucene that FTS3 sought to address was that  
when inserting new elements into the index, the insertion time was  
unpredictable.  Usually the insertions would be very fast.  But lucene  
will occasionally take a very long time for a single insertion in  
order to merge multiple smaller indices into larger indices.  This was  
seen as undesirable.  FTS3 strives to give much better worst-case  
insertion times by doing index merges incrementally and spreading the  
cost of index merges across many inserts.

D. Richard Hipp
d...@hwaci.com

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] FTS3

Reply via email to