more efficient DB compaction (fewer seeks)
------------------------------------------

                 Key: COUCHDB-738
                 URL: https://issues.apache.org/jira/browse/COUCHDB-738
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
    Affects Versions: 0.11, 0.10.1, 0.9.2
            Reporter: Adam Kocoloski
            Assignee: Adam Kocoloski


CouchDB's database compaction algorithm walks the by_seq btree, then does a 
lookup in the by_id btree for every document in the database.  It does this 
because the #full_doc_info{} record with the full revision tree is only stored 
in the by_id tree.  I'm proposing instead to store duplicate copies of 
#full_doc_info{} in both trees, and to have the compactor use the by_seq tree 
exclusively.  The net effect is significantly fewer calls to pread(), and an 
compaction IO pattern where reads tend to be clustered close to each other in 
the file.

If the by_id tree is fully cached, or if the id tree nodes are located near the 
seq tree nodes, the performance improvement is small but noticeable (~10% in 
some simple tests).  On the other hand, in the worst-case scenario of 
randomly-generated docids and a database much larger than main memory the 
improvement is huge.  Joe Williams did some simple benchmarks with a 50k 
document, 600 MB database on a 256MB VPS.  The compaction time for that DB 
dropped from 15m to 2m20s, so more than 6x faster.

Storing the #full_doc_info{} in the seq tree also allows for some similar 
optimizations in the replicator.

This patch might have downsides when documents have a large number of edits.  
These include an increase in the size of the database and slower view indexing. 
 I expect both to be small effects.

The patch can be applied directly to tr...@934272.  Existing DBs are still 
readable, new updates will be written in the new format, and databases can be 
fully upgraded by compacting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to