Re: Using CouchDB to represent the tokenized text of a book

Dirkjan Ochtman Wed, 03 Nov 2010 06:17:10 -0700

On Wed, Nov 3, 2010 at 14:04, Weston Ruter <[email protected]> wrote:
> That is a good idea, but the problem with Bible translations in particular
> is the issue of overlapping hierarchies: like chapter and verse don't always
> fall along same divisions as section and paragraph. So the data model I've
> been moving toward is standoff markup, where there is a set of tokens
> (words, punctuation) for the entire book and then a set of structures
> (paragraphs, verses, etc) that refer to the start token and end token, so
> when getting a structure it needs to retrieve all tokens from start to end.
> The use of standoff markup and overlapping hierarchies makes your idea of
> using sorting buckets not feasible, I don't think. Thanks for the idea
> though!


Not sure I agree. My "buckets" are somewhat arbitrary and don't
actually have to be mapped to any real structure. The trick is just
that by prefixing with a bucket index, you don't have to update all
tokens anymore, you only have to update tokens inside the bucket (or
the next bucket if you happened to be moving a token to the next
bucket). Your standoff thing (I'm not really used to that term, so no
clue if I'm using it correctly) would still work, only you now
reference tokens by bucket and token index, not just token index.

Cheers,

Dirkjan

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to