On 2 May 2014 07:57, Dan Kennedy <danielk1...@gmail.com> wrote:

> On 05/01/2014 03:30 PM, andrewmo wrote:
>
>> We are using the FTS3 extension to sqlite to store large numbers of short
>> (~300 byte) documents. This is working very well and providing us with
>> very
>> fast text search, but the behaviour around deletion of documents has me
>> confused.
>>
>> Our system must control the the size of the database and will delete the
>> oldest documents when the database size breaches a certain limit. I now
>> understand from comments on this mailing list and elsewhere that this is
>> not
>> an optimal pattern for the FTS extension as doclists for the oldest
>> documents are the least likely to be 'merged'.
>>
>> My question is, does this actually work at all? If I delete a row from my
>> FTS4 table (resulting in a new empty doclist being added to the index),
>> then
>> I subsequently add many (1000s) new documents and call the 'merge'
>> function
>> several times (automerge is also enabled), is there any gaurentee that the
>> empty doclist and the populated doclist that it superseded will ever be
>> removed? My testing suggests this isn't the case.
>>
>> I have a 1GB database with 6million documents. If I keep adding new
>> documents at around 1 per second and deleting documents when the size of
>> the
>> data goes beyond 1GB, the size of the index seems to grow and the number
>> of
>> documents I can store in the 1GB file seems decrease in a linear manner.
>>
>> Calling the 'optimize' function seems to solve this issue (removing all
>> the
>> dead doclists), but that isn't practical for our software, as it implies
>> some downtime for our high availablity service due to the long execution
>> time of the optimize function (Could be minutes for a 1GB file).
>>
>> I have seen this
>> (http://sqlite.1065341.n5.nabble.com/fts3-database-grows-td42069.html)
>> post
>> from 2008. However, it predates the 'automerge' and manual merge features,
>> and from the documentation I assumed these new features would delete all
>> the
>> data related to deleted documents. Am I incorrect in my assumption?
>>
>> Thanks for any clarification you can offer.
>>
>
> Normally, when you write to an FTS index (either to add new doclists or to
> add delete markers) the new entries are accumulated in-memory for a while
> and then flushed to a new "level-0" b-tree. A level-0 b-tree is often
> roughly 1MB in size. Once there are 16 level-0 b-trees, they are merged
> and written to a single level-1 b-tree. Once there are 16 level-1
> b-trees...
> And so on.
>
> So when an entry is deleted from the FTS index, a delete marker is added.
> But the original doclists are not actually deleted until the delete marker
> and the doclists are merged into the same b-tree. Delete markers are
> discarded when they are merged into the oldest b-tree in the index.
>
> At first glance it seems (to me) that this means the index might grow to
> anything up to 16 times its "optimized" size. But I think it's actually
> worse than that.
>
> Say your entire database fits into a single level-N b-tree. You keep adding
> data (and delete markers) until there are 15 level-N b-trees and almost
> enough data to create the 16th in lower levels. So at this point the FTS
> index is 16 times its optimal size. If you then add even more data so that
> the 16th level-N b-tree is created, everything gets merged together and
> we're back in the optimal state - everything in a single b-tree. However -
> this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time,
> much more data will have to be added before everything is merged together
> again.
>
> So I'm thinking a solution might be:
>
>   * Fix FTS so that it picks this case - when a merge includes so many
>     delete markers that the output is small enough to be deemed a level-N
>     b-tree, not a level-N+1 b-tree, and
>
>   * Instead of using the default 16-way merges, the app could organize
>     to periodically invoke the "merge=X,Y" command with a smaller Y value
>     (say 2) to limit the maximum size of the index to Y times its optimal
>     size (instead of 16 times).
>
> It is an interesting problem. And the above is just guesswork... It would
> be good to verify experimentally that the index really does grow
> indefinitely
> with this kind of input before trying to "fix" anything.
>
> Dan.
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

Thanks for your response Dan. Our application currently runs "merge=300,8"
( in a loop until no more modifications are being made) whenever it deletes
records due to the database size or after a certain number of insertions
have been made.

I did read the advice in the documentation that this should be "run in an
idle thread". But adding an extra thread to trigger this seemed like
unnecessary complication to our system, where spiky insertion performance
isn't too much of an issue.

We are currently running a test (been going for 6 days so far) that adds
records to our database at 1 per second (a realistic customer scenario) and
deletes records (in chunks of 8000) whenever the database size goes above
1GB while tracking the number of records that the database contains.

If the database is working, then I expect the number of records in the
database to initially drop, as records are deleted due to database size and
the b-trees grow, but to settle down over time as the b-trees are merged.
Whereas if the graph of number of records vs time seems to drop in a linear
manner then that would suggest it is not working.

The insertion rate we are using is actually much lower than the database
will handle, so I might alter our test to increase the number of insertions
to get a faster picture of the overall trend.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to