On 05/02/2014 04:13 PM, Andrew Moss wrote:
On 2 May 2014 07:57, Dan Kennedy <danielk1...@gmail.com> wrote:

On 05/01/2014 03:30 PM, andrewmo wrote:

We are using the FTS3 extension to sqlite to store large numbers of short
(~300 byte) documents. This is working very well and providing us with
very
fast text search, but the behaviour around deletion of documents has me
confused.

Our system must control the the size of the database and will delete the
oldest documents when the database size breaches a certain limit. I now
understand from comments on this mailing list and elsewhere that this is
not
an optimal pattern for the FTS extension as doclists for the oldest
documents are the least likely to be 'merged'.

My question is, does this actually work at all? If I delete a row from my
FTS4 table (resulting in a new empty doclist being added to the index),
then
I subsequently add many (1000s) new documents and call the 'merge'
function
several times (automerge is also enabled), is there any gaurentee that the
empty doclist and the populated doclist that it superseded will ever be
removed? My testing suggests this isn't the case.

I have a 1GB database with 6million documents. If I keep adding new
documents at around 1 per second and deleting documents when the size of
the
data goes beyond 1GB, the size of the index seems to grow and the number
of
documents I can store in the 1GB file seems decrease in a linear manner.

Calling the 'optimize' function seems to solve this issue (removing all
the
dead doclists), but that isn't practical for our software, as it implies
some downtime for our high availablity service due to the long execution
time of the optimize function (Could be minutes for a 1GB file).

I have seen this
(http://sqlite.1065341.n5.nabble.com/fts3-database-grows-td42069.html)
post
from 2008. However, it predates the 'automerge' and manual merge features,
and from the documentation I assumed these new features would delete all
the
data related to deleted documents. Am I incorrect in my assumption?

Thanks for any clarification you can offer.

Normally, when you write to an FTS index (either to add new doclists or to
add delete markers) the new entries are accumulated in-memory for a while
and then flushed to a new "level-0" b-tree. A level-0 b-tree is often
roughly 1MB in size. Once there are 16 level-0 b-trees, they are merged
and written to a single level-1 b-tree. Once there are 16 level-1
b-trees...
And so on.

So when an entry is deleted from the FTS index, a delete marker is added.
But the original doclists are not actually deleted until the delete marker
and the doclists are merged into the same b-tree. Delete markers are
discarded when they are merged into the oldest b-tree in the index.

At first glance it seems (to me) that this means the index might grow to
anything up to 16 times its "optimized" size. But I think it's actually
worse than that.

Say your entire database fits into a single level-N b-tree. You keep adding
data (and delete markers) until there are 15 level-N b-trees and almost
enough data to create the 16th in lower levels. So at this point the FTS
index is 16 times its optimal size. If you then add even more data so that
the 16th level-N b-tree is created, everything gets merged together and
we're back in the optimal state - everything in a single b-tree. However -
this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time,
much more data will have to be added before everything is merged together
again.

So I'm thinking a solution might be:

   * Fix FTS so that it picks this case - when a merge includes so many
     delete markers that the output is small enough to be deemed a level-N
     b-tree, not a level-N+1 b-tree, and

   * Instead of using the default 16-way merges, the app could organize
     to periodically invoke the "merge=X,Y" command with a smaller Y value
     (say 2) to limit the maximum size of the index to Y times its optimal
     size (instead of 16 times).

It is an interesting problem. And the above is just guesswork... It would
be good to verify experimentally that the index really does grow
indefinitely
with this kind of input before trying to "fix" anything.

Dan.


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Thanks for your response Dan. Our application currently runs "merge=300,8"
( in a loop until no more modifications are being made) whenever it deletes
records due to the database size or after a certain number of insertions
have been made.

I did read the advice in the documentation that this should be "run in an
idle thread". But adding an extra thread to trigger this seemed like
unnecessary complication to our system, where spiky insertion performance
isn't too much of an issue.

We are currently running a test (been going for 6 days so far) that adds
records to our database at 1 per second (a realistic customer scenario) and
deletes records (in chunks of 8000) whenever the database size goes above
1GB while tracking the number of records that the database contains.

If the database is working, then I expect the number of records in the
database to initially drop, as records are deleted due to database size and
the b-trees grow, but to settle down over time as the b-trees are merged.
Whereas if the graph of number of records vs time seems to drop in a linear
manner then that would suggest it is not working.

The insertion rate we are using is actually much lower than the database
will handle, so I might alter our test to increase the number of insertions
to get a faster picture of the overall trend.

A query like:

  SELECT level, count(*) AS ntree FROM yourftstablename_segdir;

will tell you how many b-trees there currently are at each level. Which
might help you figure out what is going on and when you might expect
a merge to actually start removing data from the index. More on the
schema of the underlying data structure used by the FTS index here:

  http://www.sqlite.org/fts3.html#section_9

Dan.





_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to