Re: Using CouchDB to represent the tokenized text of a book

Nils Breunese Wed, 03 Nov 2010 03:47:06 -0700

Weston Ruter wrote:

> I am investigating alternative methods of storing the tokenized text of a
> book in a database. The text is broken up into individual tokens (e.g. word,
> punction mark, etc) and each are assigned a separate ID and exist as
> separate objects. There can be hundreds of thousands of token objects. The
> existing method I've employed in a relational database is to have a table
> Token(id, data, position), where position is an integer used to ensure the
> tokens are rendered in the proper document order. The obvious problem with
> this use of "position" is with insertions and deletions, which causes an
> update to be to be necessary on all subsequent tokens and this is expensive,
> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position
>> old_token_position
>
> I was hoping that CouchDB with its support for documents containing arrays
> is that I could avoid an explicit position at all and just rely on the
> implicit position each object has in the context of where it lies in the
> token array (this would also facilitate text revisions). In this approach,
> however, it is critical that the entire document not have to be re-loaded
> and re-saved when a change is made (as I imaging this would be even slower
> than SQL UPDATE); I was hoping that an insertion or deletion could be done
> in a patch manner so that they could be done efficiently. But from asking my
> question on Twitter, it appears that the existing approach I took with the
> relational database is also what would be required by CouchDB.


That is correct. Storing the tokens of a book in an array in a single document 
would require retrieving, modifying and saving the complete document for a 
change. Storing the tokens as separate documents with an increasing ID would of 
course involve the same kind of updating as you are doing in your relational 
setup.

It sounds like a linked list kind of storage scenario, where every token has 
pointers to the previous and next token, might better fit your needs for 
reconstructing a book from the tokens.

> Is there a more elegant way to store my data set in CouchDB?

If I were to use CouchDB I think I'd use a document per token. I'd test how 
expensive updating the sequence id's is (using the HTTP bulk document API [0]) 
and depending on how often sequence updates need to happen I might switch to 
use a linked list kind of approach. (You could use the same in a relational 
database of course.)

Are you planning on storing more than just the tokens and their order? If not, 
I'm wondering what the use of storing a book as a list of tokens actually is. 
Sounds like a plain text file would do the job as well, but I'm sure there is a 
point. :o)

> Note that I am very new to CouchDB and am ignorant of a lot of its features.

The Definitive Guide [1] is a nice read.

Nils.

[0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
[1] http://guide.couchdb.org/
------------------------------------------------------------------------
 VPRO
 phone:  +31(0)356712911
 e-mail: [email protected]
 web:    www.vpro.nl
------------------------------------------------------------------------

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to