Hi Konrad, I'll take a stab at this and if I'm wrong hopefully someone will correct me.
The on disk BTree is written in an append only fashion rather than modified in place. Append only updates mean that every inner node of the BTree along the path from the root to the new update has to be re-written each time. Initially, when there are very few inner nodes, the amount of disk space used for each new update is relatively constant. Since the tree has a large fan-out the depth does not change much at first. In the second graph you are seeing a tree that has a depth of 1 (just the root) being written over an over again to disk and the corresponding expected linear growth results. However, when you have a higher revision limit the old revisions are kept in the tree and the tree grows taller and fatter with each update. As you make more updates more inner nodes need to be rewritten for each update which causes the growth to accelerate. Eventually, you hit the revision limit and old revisions are discarded, the tree stops getting any taller or fatter and the number of inner nodes that need to be changed for each update remains relatively constant (but greater than in the case of rev_limit=1). I suspect that the first graph becomes linear above 1000 updates and does not continue to accelerate. Cheers, Randall 2010/5/31 Konrad Förstner <[email protected]>: > Hi, > > I have an issue with CouchDB and posted the question on stackoverflow > [1] but did not get any helpful answer. I would be great if somebody > could answer this here or a stackoverflow (there I also had a problem > with the compaction which was just a timing issue as explaint in the > comment) > > I was wondering why my CouchDB database was growing to fast so I wrote > a little test script [2]. This script changes an attributed of a CouchDB > document 1200 times and takes the size of the database after each > change. After performing these 1200 writing steps the database is > doing a compaction step and the db size is measured again. In the end > the script plots the databases size against the revision numbers. The > benchmarking is run twice: > > * The first time the default number of document revision (=1000) is used > (_revs_limit). > > * The second time the number of document revisions is set to 1. > > The first run produces the following plot > http://www.flickr.com/photos/konradfoerstner/4656011444/ > > The second run produces this plot second run > http://www.flickr.com/photos/konradfoerstner/4656012732/ > > For me this is quite an unexpected behavior. In the first run I would > have expected a linear growth as every change produces a new > revision. When the 1000 revisions are reached the size value should be > constant as the older revisions are discarded. > > In the second run the first revision should result in certain database > size that is then keeps during the following writing steps as every > new revision leads to the deletion of the previous one. > > I could understand if there is a little bit of overhead needed to > manage the changes but this growth behavior seems weird to me. Can > anybody explain this phenomenon or correct my assumptions that lead to > the wrong expectations? > > Many thanks in advance > > Konrad > > [1] > http://stackoverflow.com/questions/2921151/why-do-my-couchdb-databases-grow-so-fast > [2] http://github.com/konrad/couchdb-benchmarking > > > > > >
