Using CouchDB to represent the tokenized text of a book

Weston Ruter Wed, 03 Nov 2010 11:59:14 -0700

I am investigating alternative methods of storing the tokenized text of a
book in a database. The text is broken up into individual tokens (e.g. word,
punction mark, etc) and each are assigned a separate ID and exist as
separate objects. There can be hundreds of thousands of token objects. The
existing method I've employed in a relational database is to have a table
Token(id, data, position), where position is an integer used to ensure the
tokens are rendered in the proper document order. The obvious problem with
this use of "position" is with insertions and deletions, which causes an
update to be to be necessary on all subsequent tokens and this is expensive,
e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position
> old_token_position


I was hoping that CouchDB with its support for documents containing arrays
is that I could avoid an explicit position at all and just rely on the
implicit position each object has in the context of where it lies in the
token array (this would also facilitate text revisions). In this approach,
however, it is critical that the entire document not have to be re-loaded
and re-saved when a change is made (as I imaging this would be even slower
than SQL UPDATE); I was hoping that an insertion or deletion could be done
in a patch manner so that they could be done efficiently. But from asking my
question on Twitter, it appears that the existing approach I took with the
relational database is also what would be required by CouchDB.

Is there a more elegant way to store my data set in CouchDB?

Note that I am very new to CouchDB and am ignorant of a lot of its features.

Thanks!
Weston


Previous conversation on Twitter:

So let's say I have a huge CouchDB document, like an object with millions of
properties. Are updates efficient i.e. can be patches?
http://twitter.com/#!/westonruter/status/29401798345

@westonruter you can use an _update function to update that saving
wire-transport. But a doc that large sounds like wrong architecture.
http://twitter.com/#!/CouchDB/status/29410715036

@CouchDB I'm looking to represent the text of a book, where each word
(token) is a discrete object in an ordered set. Best way to represent?
http://twitter.com/#!/westonruter/status/29415056023

@westonruter knee-jerk idea: import each word as a separate document and
store the position number with it. Use a view to sort.
http://twitter.com/#!/CouchDB/status/29415298940

@westonruter but there may be smarter ways to do that, best to ask on the
[email protected] mailing list: http://bit.ly/agv3ye
http://twitter.com/#!/CouchDB/status/29415412524

@CouchDB That's exactly what I was hoping to avoid—storing token positions,
and instead just using the objects' implicit array positions).
http://twitter.com/#!/westonruter/status/29415975571

@westonruter Documents are atomic, you PUT the whole document each revision.
(Can you use a view on a bunch of smaller documents instead?)
http://twitter.com/#!/natevw/status/29415990407

@westonruter got it. the megadoc may work out, but _update still reads it
from disk into memory fully, so it's not "ideal".
http://twitter.com/#!/CouchDB/status/29416178621

@westonruter I might be the wrong tool for the job. -- But do write the
mailing list to see what the others come up with :)
http://twitter.com/#!/CouchDB/status/29416215398

@westonruter OTOH if you don't care to make views on the data at all, you
could split it into doc attachments (which can be PUT separately).
http://twitter.com/#!/natevw/status/29416254893

-- 
Weston Ruter
http://weston.ruter.net/
@westonruter <http://twitter.com/westonruter> - Google
Profile<http://www.google.com/profiles/WestonRuter#about>

Using CouchDB to represent the tokenized text of a book

Reply via email to