I am investigating alternative methods of storing the tokenized text of a book in a database. The text is broken up into individual tokens (e.g. word, punction mark, etc) and each are assigned a separate ID and exist as separate objects. There can be hundreds of thousands of token objects. The existing method I've employed in a relational database is to have a table Token(id, data, position), where position is an integer used to ensure the tokens are rendered in the proper document order. The obvious problem with this use of "position" is with insertions and deletions, which causes an update to be to be necessary on all subsequent tokens and this is expensive, e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position > old_token_position
I was hoping that CouchDB with its support for documents containing arrays is that I could avoid an explicit position at all and just rely on the implicit position each object has in the context of where it lies in the token array (this would also facilitate text revisions). In this approach, however, it is critical that the entire document not have to be re-loaded and re-saved when a change is made (as I imaging this would be even slower than SQL UPDATE); I was hoping that an insertion or deletion could be done in a patch manner so that they could be done efficiently. But from asking my question on Twitter, it appears that the existing approach I took with the relational database is also what would be required by CouchDB. Is there a more elegant way to store my data set in CouchDB? Note that I am very new to CouchDB and am ignorant of a lot of its features. Thanks! Weston Previous conversation on Twitter: So let's say I have a huge CouchDB document, like an object with millions of properties. Are updates efficient i.e. can be patches? http://twitter.com/#!/westonruter/status/29401798345 @westonruter you can use an _update function to update that saving wire-transport. But a doc that large sounds like wrong architecture. http://twitter.com/#!/CouchDB/status/29410715036 @CouchDB I'm looking to represent the text of a book, where each word (token) is a discrete object in an ordered set. Best way to represent? http://twitter.com/#!/westonruter/status/29415056023 @westonruter knee-jerk idea: import each word as a separate document and store the position number with it. Use a view to sort. http://twitter.com/#!/CouchDB/status/29415298940 @westonruter but there may be smarter ways to do that, best to ask on the [email protected] mailing list: http://bit.ly/agv3ye http://twitter.com/#!/CouchDB/status/29415412524 @CouchDB That's exactly what I was hoping to avoid—storing token positions, and instead just using the objects' implicit array positions). http://twitter.com/#!/westonruter/status/29415975571 @westonruter Documents are atomic, you PUT the whole document each revision. (Can you use a view on a bunch of smaller documents instead?) http://twitter.com/#!/natevw/status/29415990407 @westonruter got it. the megadoc may work out, but _update still reads it from disk into memory fully, so it's not "ideal". http://twitter.com/#!/CouchDB/status/29416178621 @westonruter I might be the wrong tool for the job. -- But do write the mailing list to see what the others come up with :) http://twitter.com/#!/CouchDB/status/29416215398 @westonruter OTOH if you don't care to make views on the data at all, you could split it into doc attachments (which can be PUT separately). http://twitter.com/#!/natevw/status/29416254893 -- Weston Ruter http://weston.ruter.net/ @westonruter <http://twitter.com/westonruter> - Google Profile<http://www.google.com/profiles/WestonRuter#about>
