Re: Using CouchDB to represent the tokenized text of a book

Nils Breunese Wed, 03 Nov 2010 09:37:24 -0700

Weston Ruter wrote:

> Specifically, I'm looking at books that are in a constant flux, i.e. books 
> that are being edited. The application here is for Bible translations in 
> particular, where each word token needs to be keyed into other metadata, like 
> link to source word, insertion datetime, translator, etc. Now that I think of 
> it, in order to be referencable, each token would have to exist as a separate 
> document anyway since parts of documents aren't indexed by ID, I wouldn't 
> think.


That's right. You'll definitely want to use a document per token here.

> I never thought about using a linked list before for this application, good 
> idea. It would certainly speed up the update process, but it would make 
> retrieving all tokens for a structure between a start token and end very slow 
> as there would need to be a separate query for each of the tokens in the 
> structure to look up each next token to retrieve.

Yep, that's the trade-off of linked lists. O(1) for inserts, but O(n) for 
lookups. Arrays are the other way around.

> As I mentioned above, metadata and related data are both going to be 
> externally attached to each token at various sources, so each token needs to  
> referenced by ID. This fact alone invalidates a single-document approach 
> because parts of a document can't be linked to, correct?

Correct. Well, you could maybe contruct a document with sections which have 
ID's of their own, but that doesn't sound very relaxing.

Nils Breunese.
------------------------------------------------------------------------
 VPRO
 phone:  +31(0)356712911
 e-mail: [email protected]
 web:    www.vpro.nl
------------------------------------------------------------------------

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to