Re: Using CouchDB to represent the tokenized text of a book

Kevin R. Coombes Wed, 03 Nov 2010 08:17:24 -0700

 Thanks for pointing this out.

But I still think real numbers are the right way to go for thisproblem. The issues pointed out in the link you provided just mean thatyou may occasionally have to perform maintenance on the entirecollection to regularize the positions (by making them all integersagain before starting a new series of edits). But this could be aspecial task that carefully moves the database to another location, andit would still make the day-to-day use faster and more reliable. (Afterall, trying to update all positions in the database that occur after acertain point -- while there are other potential edits going on -- seemslike a highly non-robust process.)


On 11/3/2010 10:07 AM, Freddy Bowen wrote:

Beware real numbers:
https://issues.apache.org/jira/browse/COUCHDB-227

FB

On Wed, Nov 3, 2010 at 10:42 AM, Kevin R. Coombes<[email protected]

wrote:
  Why not avoid rewriting the subset at all?

Use position as a real number instead of an integer.  An insertion between
positions 100 and 101 can be assigned the position 100.5.  Since you can
always find something between any two positions to make an insertion
insertion, you don't have to update anything else.  For deletions, you just
have to allow gaps.  The real number positions indicate this anyway.  You
recover a coherent section of the book (in CouchDB fashion) by specifying
the startkey and endkey on a query based on the (real number) position.

    Kevin


On 11/3/2010 8:04 AM, Weston Ruter wrote:

Thanks a lot for the replies, Dirkjan and Nils. My replies inline below:

On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman<[email protected]>
  wrote:

  On Wed, Nov 3, 2010 at 11:16, Weston Ruter<[email protected]>

  wrote:

I am investigating alternative methods of storing the tokenized text of
a
book in a database. The text is broken up into individual tokens (e.g.

word,

punction mark, etc) and each are assigned a separate ID and exist as
separate objects. There can be hundreds of thousands of token objects.

The

existing method I've employed in a relational database is to have a
table
Token(id, data, position), where position is an integer used to ensure

the

tokens are rendered in the proper document order. The obvious problem

with

this use of "position" is with insertions and deletions, which causes an
update to be to be necessary on all subsequent tokens and this is

expensive,

e.g. after deleting: UPDATE Token SET position = position - 1 WHERE

position

old_token_position
A "book" doesn't really sound like something that suffers from a lot

of insertions and deletions in the middle...

  Specifically, I'm looking at books that are in a constant flux, i.e.

books
that are being edited. The application here is for Bible translations in
particular, where each word token needs to be keyed into other metadata,
like link to source word, insertion datetime, translator, etc. Now that I
think of it, in order to be referencable, each token would have to exist
as
a separate document anyway since parts of documents aren't indexed by ID,
I
wouldn't think.




  I was hoping that CouchDB with its support for documents containing

arrays

is that I could avoid an explicit position at all and just rely on the
implicit position each object has in the context of where it lies in the
token array (this would also facilitate text revisions). In this

approach,

however, it is critical that the entire document not have to be
re-loaded
and re-saved when a change is made (as I imaging this would be even

slower

than SQL UPDATE); I was hoping that an insertion or deletion could be

done

in a patch manner so that they could be done efficiently. But from
asking

my

question on Twitter, it appears that the existing approach I took with

the

relational database is also what would be required by CouchDB.

Yeah, CouchDB doesn't support patching documents, so you'd have to
update the whole document. My gut feeling says you don't want a large
document here.

  Is there a more elegant way to store my data set in CouchDB?
It sounds like you want to come up with a kind of index value that
will prevent you from having to update all the documents (but update a
subset indeed), and then use that value as a sorting bucket.

For instance, in your book model, save the page number with the word,
update all the other words on that page that come after it, sort by
page then word order. But this type of idea could work in either
relational databases or CouchDB. If pages are too large still, you
could add paragraphs (or add chapters before pages).

  That is a good idea, but the problem with Bible translations in

particular
is the issue of overlapping hierarchies: like chapter and verse don't
always
fall along same divisions as section and paragraph. So the data model I've
been moving toward is standoff markup, where there is a set of tokens
(words, punctuation) for the entire book and then a set of structures
(paragraphs, verses, etc) that refer to the start token and end token, so
when getting a structure it needs to retrieve all tokens from start to
end.
The use of standoff markup and overlapping hierarchies makes your idea of
using sorting buckets not feasible, I don't think. Thanks for the idea
though!




  Cheers,

Dirkjan

  On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese<[email protected]>

  wrote:

  Weston Ruter wrote:

  I am investigating alternative methods of storing the tokenized text of

a
book in a database. The text is broken up into individual tokens (e.g.

word,

punction mark, etc) and each are assigned a separate ID and exist as
separate objects. There can be hundreds of thousands of token objects.

The

existing method I've employed in a relational database is to have a
table
Token(id, data, position), where position is an integer used to ensure

the

tokens are rendered in the proper document order. The obvious problem

with

this use of "position" is with insertions and deletions, which causes an
update to be to be necessary on all subsequent tokens and this is

expensive,

e.g. after deleting: UPDATE Token SET position = position - 1 WHERE

position

old_token_position
I was hoping that CouchDB with its support for documents containing

arrays

is that I could avoid an explicit position at all and just rely on the
implicit position each object has in the context of where it lies in the
token array (this would also facilitate text revisions). In this

approach,

however, it is critical that the entire document not have to be
re-loaded
and re-saved when a change is made (as I imaging this would be even

slower

than SQL UPDATE); I was hoping that an insertion or deletion could be

done

in a patch manner so that they could be done efficiently. But from
asking

my

question on Twitter, it appears that the existing approach I took with

the

relational database is also what would be required by CouchDB.

That is correct. Storing the tokens of a book in an array in a single
document would require retrieving, modifying and saving the complete
document for a change. Storing the tokens as separate documents with an
increasing ID would of course involve the same kind of updating as you
are
doing in your relational setup.

It sounds like a linked list kind of storage scenario, where every token
has pointers to the previous and next token, might better fit your needs
for
reconstructing a book from the tokens.

  I never thought about using a linked list before for this application,

good
idea. It would certainly speed up the update process, but it would make
retrieving all tokens for a structure between a start token and end very
slow as there would need to be a separate query for each of the tokens in
the structure to look up each next token to retrieve.




  Is there a more elegant way to store my data set in CouchDB?

If I were to use CouchDB I think I'd use a document per token. I'd test
how
expensive updating the sequence id's is (using the HTTP bulk document API
[0]) and depending on how often sequence updates need to happen I might
switch to use a linked list kind of approach. (You could use the same in
a
relational database of course.)

Are you planning on storing more than just the tokens and their order? If
not, I'm wondering what the use of storing a book as a list of tokens
actually is. Sounds like a plain text file would do the job as well, but
I'm
sure there is a point. :o)

  As I mentioned above, metadata and related data are both going to be

externally attached to each token at various sources, so each token needs
to  referenced by ID. This fact alone invalidates a single-document
approach
because parts of a document can't be linked to, correct?



  Note that I am very new to CouchDB and am ignorant of a lot of its

features.

The Definitive Guide [1] is a nice read.

  Thanks for the advice!



  Nils.

[0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
[1] http://guide.couchdb.org/
------------------------------------------------------------------------
  VPRO
  phone:  +31(0)356712911
  e-mail: [email protected]
  web:    www.vpro.nl
------------------------------------------------------------------------

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to