Thanks a lot for the replies, Dirkjan and Nils. My replies inline below:
On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman<[email protected]>
wrote:
On Wed, Nov 3, 2010 at 11:16, Weston Ruter<[email protected]>
wrote:
I am investigating alternative methods of storing the tokenized text of
a
book in a database. The text is broken up into individual tokens (e.g.
word,
punction mark, etc) and each are assigned a separate ID and exist as
separate objects. There can be hundreds of thousands of token objects.
The
existing method I've employed in a relational database is to have a
table
Token(id, data, position), where position is an integer used to ensure
the
tokens are rendered in the proper document order. The obvious problem
with
this use of "position" is with insertions and deletions, which causes an
update to be to be necessary on all subsequent tokens and this is
expensive,
e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
position
old_token_position
A "book" doesn't really sound like something that suffers from a lot
of insertions and deletions in the middle...
Specifically, I'm looking at books that are in a constant flux, i.e.
books
that are being edited. The application here is for Bible translations in
particular, where each word token needs to be keyed into other metadata,
like link to source word, insertion datetime, translator, etc. Now that I
think of it, in order to be referencable, each token would have to exist
as
a separate document anyway since parts of documents aren't indexed by ID,
I
wouldn't think.
I was hoping that CouchDB with its support for documents containing
arrays
is that I could avoid an explicit position at all and just rely on the
implicit position each object has in the context of where it lies in the
token array (this would also facilitate text revisions). In this
approach,
however, it is critical that the entire document not have to be
re-loaded
and re-saved when a change is made (as I imaging this would be even
slower
than SQL UPDATE); I was hoping that an insertion or deletion could be
done
in a patch manner so that they could be done efficiently. But from
asking
my
question on Twitter, it appears that the existing approach I took with
the
relational database is also what would be required by CouchDB.
Yeah, CouchDB doesn't support patching documents, so you'd have to
update the whole document. My gut feeling says you don't want a large
document here.
Is there a more elegant way to store my data set in CouchDB?
It sounds like you want to come up with a kind of index value that
will prevent you from having to update all the documents (but update a
subset indeed), and then use that value as a sorting bucket.
For instance, in your book model, save the page number with the word,
update all the other words on that page that come after it, sort by
page then word order. But this type of idea could work in either
relational databases or CouchDB. If pages are too large still, you
could add paragraphs (or add chapters before pages).
That is a good idea, but the problem with Bible translations in
particular
is the issue of overlapping hierarchies: like chapter and verse don't
always
fall along same divisions as section and paragraph. So the data model I've
been moving toward is standoff markup, where there is a set of tokens
(words, punctuation) for the entire book and then a set of structures
(paragraphs, verses, etc) that refer to the start token and end token, so
when getting a structure it needs to retrieve all tokens from start to
end.
The use of standoff markup and overlapping hierarchies makes your idea of
using sorting buckets not feasible, I don't think. Thanks for the idea
though!
Cheers,
Dirkjan
On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese<[email protected]>
wrote:
Weston Ruter wrote:
I am investigating alternative methods of storing the tokenized text of
a
book in a database. The text is broken up into individual tokens (e.g.
word,
punction mark, etc) and each are assigned a separate ID and exist as
separate objects. There can be hundreds of thousands of token objects.
The
existing method I've employed in a relational database is to have a
table
Token(id, data, position), where position is an integer used to ensure
the
tokens are rendered in the proper document order. The obvious problem
with
this use of "position" is with insertions and deletions, which causes an
update to be to be necessary on all subsequent tokens and this is
expensive,
e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
position
old_token_position
I was hoping that CouchDB with its support for documents containing
arrays
is that I could avoid an explicit position at all and just rely on the
implicit position each object has in the context of where it lies in the
token array (this would also facilitate text revisions). In this
approach,
however, it is critical that the entire document not have to be
re-loaded
and re-saved when a change is made (as I imaging this would be even
slower
than SQL UPDATE); I was hoping that an insertion or deletion could be
done
in a patch manner so that they could be done efficiently. But from
asking
my
question on Twitter, it appears that the existing approach I took with
the
relational database is also what would be required by CouchDB.
That is correct. Storing the tokens of a book in an array in a single
document would require retrieving, modifying and saving the complete
document for a change. Storing the tokens as separate documents with an
increasing ID would of course involve the same kind of updating as you
are
doing in your relational setup.
It sounds like a linked list kind of storage scenario, where every token
has pointers to the previous and next token, might better fit your needs
for
reconstructing a book from the tokens.
I never thought about using a linked list before for this application,
good
idea. It would certainly speed up the update process, but it would make
retrieving all tokens for a structure between a start token and end very
slow as there would need to be a separate query for each of the tokens in
the structure to look up each next token to retrieve.
Is there a more elegant way to store my data set in CouchDB?
If I were to use CouchDB I think I'd use a document per token. I'd test
how
expensive updating the sequence id's is (using the HTTP bulk document API
[0]) and depending on how often sequence updates need to happen I might
switch to use a linked list kind of approach. (You could use the same in
a
relational database of course.)
Are you planning on storing more than just the tokens and their order? If
not, I'm wondering what the use of storing a book as a list of tokens
actually is. Sounds like a plain text file would do the job as well, but
I'm
sure there is a point. :o)
As I mentioned above, metadata and related data are both going to be
externally attached to each token at various sources, so each token needs
to referenced by ID. This fact alone invalidates a single-document
approach
because parts of a document can't be linked to, correct?
Note that I am very new to CouchDB and am ignorant of a lot of its
features.
The Definitive Guide [1] is a nice read.
Thanks for the advice!
Nils.
[0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
[1] http://guide.couchdb.org/
------------------------------------------------------------------------
VPRO
phone: +31(0)356712911
e-mail: [email protected]
web: www.vpro.nl
------------------------------------------------------------------------