Weston Ruter wrote: > I am investigating alternative methods of storing the tokenized text of a > book in a database. The text is broken up into individual tokens (e.g. word, > punction mark, etc) and each are assigned a separate ID and exist as > separate objects. There can be hundreds of thousands of token objects. The > existing method I've employed in a relational database is to have a table > Token(id, data, position), where position is an integer used to ensure the > tokens are rendered in the proper document order. The obvious problem with > this use of "position" is with insertions and deletions, which causes an > update to be to be necessary on all subsequent tokens and this is expensive, > e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position >> old_token_position > > I was hoping that CouchDB with its support for documents containing arrays > is that I could avoid an explicit position at all and just rely on the > implicit position each object has in the context of where it lies in the > token array (this would also facilitate text revisions). In this approach, > however, it is critical that the entire document not have to be re-loaded > and re-saved when a change is made (as I imaging this would be even slower > than SQL UPDATE); I was hoping that an insertion or deletion could be done > in a patch manner so that they could be done efficiently. But from asking my > question on Twitter, it appears that the existing approach I took with the > relational database is also what would be required by CouchDB.
That is correct. Storing the tokens of a book in an array in a single document would require retrieving, modifying and saving the complete document for a change. Storing the tokens as separate documents with an increasing ID would of course involve the same kind of updating as you are doing in your relational setup. It sounds like a linked list kind of storage scenario, where every token has pointers to the previous and next token, might better fit your needs for reconstructing a book from the tokens. > Is there a more elegant way to store my data set in CouchDB? If I were to use CouchDB I think I'd use a document per token. I'd test how expensive updating the sequence id's is (using the HTTP bulk document API [0]) and depending on how often sequence updates need to happen I might switch to use a linked list kind of approach. (You could use the same in a relational database of course.) Are you planning on storing more than just the tokens and their order? If not, I'm wondering what the use of storing a book as a list of tokens actually is. Sounds like a plain text file would do the job as well, but I'm sure there is a point. :o) > Note that I am very new to CouchDB and am ignorant of a lot of its features. The Definitive Guide [1] is a nice read. Nils. [0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API [1] http://guide.couchdb.org/ ------------------------------------------------------------------------ VPRO phone: +31(0)356712911 e-mail: [email protected] web: www.vpro.nl ------------------------------------------------------------------------
