Re: [sqlite] fts table insert performance

2007-12-04 Thread Scott Hess
Sorry for the delayed response.  Was waiting for time to dig into this a little.

Nov 18, 2007 2:05 AM Wang Yun <[EMAIL PROTECTED]>:
> I insert rfc txt files into a full text search table, 4119 txt files are
> 188MB totally. After insert, database file is 443MB.
> Logic is below, it's not the real code.

I've attached a tcl script I used in experimenting with this.  I'm
doing this on a Redhat 9 box (kernel 2.4.22 with some additional
patches), running on a local disk both for input and the database.
The version of rfc I'm using has 4756 files with 252M of data.  I
compiled tclsqlite3 with flags:

-O6 -DNDEBUG=1 -DHAVE_FDATASYNC=1 -DSQLITE_CORE=1
-DSQLITE_ENABLE_FTS1=1 -DSQLITE_ENABLE_BROKEN_FTS1=1
-DSQLITE_ENABLE_FTS2=1 -DSQLITE_ENABLE_BROKEN_FTS2=1
-DSQLITE_ENABLE_FTS3=1 -DHAVE_DLOPEN=1

[Listing all this detail because various things can cause various
differences.  The most-relevant parameters are probably -O6
-DNDEBUG=1.]

> This cost 154 seconds, I use fts2 and my PC is Intel 2.33GHz, 2 CPUs.
> If I don't use fts, just insert into normal table, will cost 11 seconds.

Your results are a bit worse than I'd expect.  I get 6.2s for the
simple-table case, and about 45s for the fts3 case, not quite your 14x
slow-down.  fts2 and fts3 should be nearly identical for this kind of
test.

Basic table, one transaction:
2.190u 1.060s 0:06.20 52.4% 0+0k 0+0io 393pf+0w
-rw-r--r--  1 shess eng 239505408 Dec  4 11:15 dbs/baseline.db

Basic table, per-insert transaction:
3.280u 1.890s 0:34.19 15.1% 0+0k 0+0io 393pf+0w
-rw-r--r--  1 shess eng 239505408 Dec  4 11:16 dbs/baseline.db

fts3, one transaction:
40.820u 1.550s 0:44.44 95.3%0+0k 0+0io 414pf+0w
-rw-r--r--  1 shess eng 307900416 Dec  4 11:17 dbs/baseline.db

fts3, per-insert transaction:
61.300u 3.330s 1:43.04 62.7%0+0k 0+0io 414pf+0w
-rw-r--r--  1 shess eng 348348416 Dec  4 11:21 dbs/baseline.db

I'm not seeing as much bloat as you describe - perhaps I'm using a
bigger page size.

> I don't know when sqlite will update the full text index, after each insert?

fts2/3 update the index after each transaction, and also before each
insert where an explicit rowid is less than the maximum rowid seen in
the current transaction, and also before any query run against the
table.  In this case, it should only be updating at the end of the
transaction, and as needed when the in-memory table fills up.

> How can I improve the performance?

The optimal case for fts2/3 is to do many inserts per transaction,
letting the table select rowid/docid.  In that case it will collect
the new data in memory and flush it to disk less frequently.  For a
test of this size, it might also help to have a bigger page cache so
that SQLite doesn't have to flush the journal file to disk.  You could
define kPendingThreshold in the fts3.c (or fts2.c) source code to be
larger (this doesn't seem to help me at all, though).

Without going in and profiling things, though, my guess is that the
time in this test is dominated by tokenization.  I'm basing this on
the high CPU utilization, and the lack of impact from tweaking
kPendingThreshold and pragma cache_size.  Possibly there's also some
cost from segment merges, though those should generally be fairly I/O
dominated.

-scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-

[sqlite] fts table insert performance

2007-11-18 Thread Wang Yun
Hi

I insert rfc txt files into a full text search table, 4119 txt files are
188MB totally. After insert, database file is 443MB.
Logic is below, it's not the real code.


BEGIN_TRANSACTION

CREATE TABLE Notes ( rowid INTEGER PRIMARY KEY, docid, CreateTime,
LastModifyTime )
CREATE VIRTUAL TABLE NotesFTS using fts2( Title, Content )

for( 4119 txt file )
{
open and read file content into a string

INSERT INTO NotesFTS ( Title, Content ) VALUES ( ..., ... )
INSERT INTO Notes ( docid, CreateTime, LastModifyTime ) VALUES (
last_insert_id, DATETIME('NOW'), DATETIME('NOW') )
}

COMMIT


This cost 154 seconds, I use fts2 and my PC is Intel 2.33GHz, 2 CPUs.
If I don't use fts, just insert into normal table, will cost 11 seconds.

I don't know when sqlite will update the full text index, after each insert?
How can I improve the performance?

Thanks,
Wang Yun




-
To unsubscribe, send email to [EMAIL PROTECTED]
-