Re: Using CouchDB to represent the tokenized text of a book

Freddy Bowen Wed, 03 Nov 2010 08:08:14 -0700

Beware real numbers:
https://issues.apache.org/jira/browse/COUCHDB-227


FB

On Wed, Nov 3, 2010 at 10:42 AM, Kevin R. Coombes <[email protected]
> wrote:

>  Why not avoid rewriting the subset at all?
>
> Use position as a real number instead of an integer.  An insertion between
> positions 100 and 101 can be assigned the position 100.5.  Since you can
> always find something between any two positions to make an insertion
> insertion, you don't have to update anything else.  For deletions, you just
> have to allow gaps.  The real number positions indicate this anyway.  You
> recover a coherent section of the book (in CouchDB fashion) by specifying
> the startkey and endkey on a query based on the (real number) position.
>
>    Kevin
>
>
> On 11/3/2010 8:04 AM, Weston Ruter wrote:
>
>> Thanks a lot for the replies, Dirkjan and Nils. My replies inline below:
>>
>> On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman<[email protected]>
>>  wrote:
>>
>>  On Wed, Nov 3, 2010 at 11:16, Weston Ruter<[email protected]>
>>>  wrote:
>>>
>>>> I am investigating alternative methods of storing the tokenized text of
>>>> a
>>>> book in a database. The text is broken up into individual tokens (e.g.
>>>>
>>> word,
>>>
>>>> punction mark, etc) and each are assigned a separate ID and exist as
>>>> separate objects. There can be hundreds of thousands of token objects.
>>>>
>>> The
>>>
>>>> existing method I've employed in a relational database is to have a
>>>> table
>>>> Token(id, data, position), where position is an integer used to ensure
>>>>
>>> the
>>>
>>>> tokens are rendered in the proper document order. The obvious problem
>>>>
>>> with
>>>
>>>> this use of "position" is with insertions and deletions, which causes an
>>>> update to be to be necessary on all subsequent tokens and this is
>>>>
>>> expensive,
>>>
>>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
>>>>
>>> position
>>>
>>>> old_token_position
>>>>>
>>>> A "book" doesn't really sound like something that suffers from a lot
>>> of insertions and deletions in the middle...
>>>
>>>  Specifically, I'm looking at books that are in a constant flux, i.e.
>> books
>> that are being edited. The application here is for Bible translations in
>> particular, where each word token needs to be keyed into other metadata,
>> like link to source word, insertion datetime, translator, etc. Now that I
>> think of it, in order to be referencable, each token would have to exist
>> as
>> a separate document anyway since parts of documents aren't indexed by ID,
>> I
>> wouldn't think.
>>
>>
>>
>>
>>  I was hoping that CouchDB with its support for documents containing
>>>>
>>> arrays
>>>
>>>> is that I could avoid an explicit position at all and just rely on the
>>>> implicit position each object has in the context of where it lies in the
>>>> token array (this would also facilitate text revisions). In this
>>>>
>>> approach,
>>>
>>>> however, it is critical that the entire document not have to be
>>>> re-loaded
>>>> and re-saved when a change is made (as I imaging this would be even
>>>>
>>> slower
>>>
>>>> than SQL UPDATE); I was hoping that an insertion or deletion could be
>>>>
>>> done
>>>
>>>> in a patch manner so that they could be done efficiently. But from
>>>> asking
>>>>
>>> my
>>>
>>>> question on Twitter, it appears that the existing approach I took with
>>>>
>>> the
>>>
>>>> relational database is also what would be required by CouchDB.
>>>>
>>> Yeah, CouchDB doesn't support patching documents, so you'd have to
>>> update the whole document. My gut feeling says you don't want a large
>>> document here.
>>>
>>>  Is there a more elegant way to store my data set in CouchDB?
>>>>
>>> It sounds like you want to come up with a kind of index value that
>>> will prevent you from having to update all the documents (but update a
>>> subset indeed), and then use that value as a sorting bucket.
>>>
>>> For instance, in your book model, save the page number with the word,
>>> update all the other words on that page that come after it, sort by
>>> page then word order. But this type of idea could work in either
>>> relational databases or CouchDB. If pages are too large still, you
>>> could add paragraphs (or add chapters before pages).
>>>
>>>  That is a good idea, but the problem with Bible translations in
>> particular
>> is the issue of overlapping hierarchies: like chapter and verse don't
>> always
>> fall along same divisions as section and paragraph. So the data model I've
>> been moving toward is standoff markup, where there is a set of tokens
>> (words, punctuation) for the entire book and then a set of structures
>> (paragraphs, verses, etc) that refer to the start token and end token, so
>> when getting a structure it needs to retrieve all tokens from start to
>> end.
>> The use of standoff markup and overlapping hierarchies makes your idea of
>> using sorting buckets not feasible, I don't think. Thanks for the idea
>> though!
>>
>>
>>
>>
>>  Cheers,
>>>
>>> Dirkjan
>>>
>>>  On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese<[email protected]>
>>  wrote:
>>
>>  Weston Ruter wrote:
>>>
>>>  I am investigating alternative methods of storing the tokenized text of
>>>> a
>>>> book in a database. The text is broken up into individual tokens (e.g.
>>>>
>>> word,
>>>
>>>> punction mark, etc) and each are assigned a separate ID and exist as
>>>> separate objects. There can be hundreds of thousands of token objects.
>>>>
>>> The
>>>
>>>> existing method I've employed in a relational database is to have a
>>>> table
>>>> Token(id, data, position), where position is an integer used to ensure
>>>>
>>> the
>>>
>>>> tokens are rendered in the proper document order. The obvious problem
>>>>
>>> with
>>>
>>>> this use of "position" is with insertions and deletions, which causes an
>>>> update to be to be necessary on all subsequent tokens and this is
>>>>
>>> expensive,
>>>
>>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
>>>>
>>> position
>>>
>>>> old_token_position
>>>>>
>>>> I was hoping that CouchDB with its support for documents containing
>>>>
>>> arrays
>>>
>>>> is that I could avoid an explicit position at all and just rely on the
>>>> implicit position each object has in the context of where it lies in the
>>>> token array (this would also facilitate text revisions). In this
>>>>
>>> approach,
>>>
>>>> however, it is critical that the entire document not have to be
>>>> re-loaded
>>>> and re-saved when a change is made (as I imaging this would be even
>>>>
>>> slower
>>>
>>>> than SQL UPDATE); I was hoping that an insertion or deletion could be
>>>>
>>> done
>>>
>>>> in a patch manner so that they could be done efficiently. But from
>>>> asking
>>>>
>>> my
>>>
>>>> question on Twitter, it appears that the existing approach I took with
>>>>
>>> the
>>>
>>>> relational database is also what would be required by CouchDB.
>>>>
>>> That is correct. Storing the tokens of a book in an array in a single
>>> document would require retrieving, modifying and saving the complete
>>> document for a change. Storing the tokens as separate documents with an
>>> increasing ID would of course involve the same kind of updating as you
>>> are
>>> doing in your relational setup.
>>>
>>> It sounds like a linked list kind of storage scenario, where every token
>>> has pointers to the previous and next token, might better fit your needs
>>> for
>>> reconstructing a book from the tokens.
>>>
>>>  I never thought about using a linked list before for this application,
>> good
>> idea. It would certainly speed up the update process, but it would make
>> retrieving all tokens for a structure between a start token and end very
>> slow as there would need to be a separate query for each of the tokens in
>> the structure to look up each next token to retrieve.
>>
>>
>>
>>
>>  Is there a more elegant way to store my data set in CouchDB?
>>>>
>>> If I were to use CouchDB I think I'd use a document per token. I'd test
>>> how
>>> expensive updating the sequence id's is (using the HTTP bulk document API
>>> [0]) and depending on how often sequence updates need to happen I might
>>> switch to use a linked list kind of approach. (You could use the same in
>>> a
>>> relational database of course.)
>>>
>>> Are you planning on storing more than just the tokens and their order? If
>>> not, I'm wondering what the use of storing a book as a list of tokens
>>> actually is. Sounds like a plain text file would do the job as well, but
>>> I'm
>>> sure there is a point. :o)
>>>
>>>  As I mentioned above, metadata and related data are both going to be
>> externally attached to each token at various sources, so each token needs
>> to  referenced by ID. This fact alone invalidates a single-document
>> approach
>> because parts of a document can't be linked to, correct?
>>
>>
>>
>>  Note that I am very new to CouchDB and am ignorant of a lot of its
>>>>
>>> features.
>>>
>>> The Definitive Guide [1] is a nice read.
>>>
>>>  Thanks for the advice!
>>
>>
>>
>>  Nils.
>>>
>>> [0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
>>> [1] http://guide.couchdb.org/
>>> ------------------------------------------------------------------------
>>>  VPRO
>>>  phone:  +31(0)356712911
>>>  e-mail: [email protected]
>>>  web:    www.vpro.nl
>>> ------------------------------------------------------------------------
>>>
>>>
>>
>>

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to