Beware real numbers: https://issues.apache.org/jira/browse/COUCHDB-227
FB On Wed, Nov 3, 2010 at 10:42 AM, Kevin R. Coombes <[email protected] > wrote: > Why not avoid rewriting the subset at all? > > Use position as a real number instead of an integer. An insertion between > positions 100 and 101 can be assigned the position 100.5. Since you can > always find something between any two positions to make an insertion > insertion, you don't have to update anything else. For deletions, you just > have to allow gaps. The real number positions indicate this anyway. You > recover a coherent section of the book (in CouchDB fashion) by specifying > the startkey and endkey on a query based on the (real number) position. > > Kevin > > > On 11/3/2010 8:04 AM, Weston Ruter wrote: > >> Thanks a lot for the replies, Dirkjan and Nils. My replies inline below: >> >> On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman<[email protected]> >> wrote: >> >> On Wed, Nov 3, 2010 at 11:16, Weston Ruter<[email protected]> >>> wrote: >>> >>>> I am investigating alternative methods of storing the tokenized text of >>>> a >>>> book in a database. The text is broken up into individual tokens (e.g. >>>> >>> word, >>> >>>> punction mark, etc) and each are assigned a separate ID and exist as >>>> separate objects. There can be hundreds of thousands of token objects. >>>> >>> The >>> >>>> existing method I've employed in a relational database is to have a >>>> table >>>> Token(id, data, position), where position is an integer used to ensure >>>> >>> the >>> >>>> tokens are rendered in the proper document order. The obvious problem >>>> >>> with >>> >>>> this use of "position" is with insertions and deletions, which causes an >>>> update to be to be necessary on all subsequent tokens and this is >>>> >>> expensive, >>> >>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE >>>> >>> position >>> >>>> old_token_position >>>>> >>>> A "book" doesn't really sound like something that suffers from a lot >>> of insertions and deletions in the middle... >>> >>> Specifically, I'm looking at books that are in a constant flux, i.e. >> books >> that are being edited. The application here is for Bible translations in >> particular, where each word token needs to be keyed into other metadata, >> like link to source word, insertion datetime, translator, etc. Now that I >> think of it, in order to be referencable, each token would have to exist >> as >> a separate document anyway since parts of documents aren't indexed by ID, >> I >> wouldn't think. >> >> >> >> >> I was hoping that CouchDB with its support for documents containing >>>> >>> arrays >>> >>>> is that I could avoid an explicit position at all and just rely on the >>>> implicit position each object has in the context of where it lies in the >>>> token array (this would also facilitate text revisions). In this >>>> >>> approach, >>> >>>> however, it is critical that the entire document not have to be >>>> re-loaded >>>> and re-saved when a change is made (as I imaging this would be even >>>> >>> slower >>> >>>> than SQL UPDATE); I was hoping that an insertion or deletion could be >>>> >>> done >>> >>>> in a patch manner so that they could be done efficiently. But from >>>> asking >>>> >>> my >>> >>>> question on Twitter, it appears that the existing approach I took with >>>> >>> the >>> >>>> relational database is also what would be required by CouchDB. >>>> >>> Yeah, CouchDB doesn't support patching documents, so you'd have to >>> update the whole document. My gut feeling says you don't want a large >>> document here. >>> >>> Is there a more elegant way to store my data set in CouchDB? >>>> >>> It sounds like you want to come up with a kind of index value that >>> will prevent you from having to update all the documents (but update a >>> subset indeed), and then use that value as a sorting bucket. >>> >>> For instance, in your book model, save the page number with the word, >>> update all the other words on that page that come after it, sort by >>> page then word order. But this type of idea could work in either >>> relational databases or CouchDB. If pages are too large still, you >>> could add paragraphs (or add chapters before pages). >>> >>> That is a good idea, but the problem with Bible translations in >> particular >> is the issue of overlapping hierarchies: like chapter and verse don't >> always >> fall along same divisions as section and paragraph. So the data model I've >> been moving toward is standoff markup, where there is a set of tokens >> (words, punctuation) for the entire book and then a set of structures >> (paragraphs, verses, etc) that refer to the start token and end token, so >> when getting a structure it needs to retrieve all tokens from start to >> end. >> The use of standoff markup and overlapping hierarchies makes your idea of >> using sorting buckets not feasible, I don't think. Thanks for the idea >> though! >> >> >> >> >> Cheers, >>> >>> Dirkjan >>> >>> On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese<[email protected]> >> wrote: >> >> Weston Ruter wrote: >>> >>> I am investigating alternative methods of storing the tokenized text of >>>> a >>>> book in a database. The text is broken up into individual tokens (e.g. >>>> >>> word, >>> >>>> punction mark, etc) and each are assigned a separate ID and exist as >>>> separate objects. There can be hundreds of thousands of token objects. >>>> >>> The >>> >>>> existing method I've employed in a relational database is to have a >>>> table >>>> Token(id, data, position), where position is an integer used to ensure >>>> >>> the >>> >>>> tokens are rendered in the proper document order. The obvious problem >>>> >>> with >>> >>>> this use of "position" is with insertions and deletions, which causes an >>>> update to be to be necessary on all subsequent tokens and this is >>>> >>> expensive, >>> >>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE >>>> >>> position >>> >>>> old_token_position >>>>> >>>> I was hoping that CouchDB with its support for documents containing >>>> >>> arrays >>> >>>> is that I could avoid an explicit position at all and just rely on the >>>> implicit position each object has in the context of where it lies in the >>>> token array (this would also facilitate text revisions). In this >>>> >>> approach, >>> >>>> however, it is critical that the entire document not have to be >>>> re-loaded >>>> and re-saved when a change is made (as I imaging this would be even >>>> >>> slower >>> >>>> than SQL UPDATE); I was hoping that an insertion or deletion could be >>>> >>> done >>> >>>> in a patch manner so that they could be done efficiently. But from >>>> asking >>>> >>> my >>> >>>> question on Twitter, it appears that the existing approach I took with >>>> >>> the >>> >>>> relational database is also what would be required by CouchDB. >>>> >>> That is correct. Storing the tokens of a book in an array in a single >>> document would require retrieving, modifying and saving the complete >>> document for a change. Storing the tokens as separate documents with an >>> increasing ID would of course involve the same kind of updating as you >>> are >>> doing in your relational setup. >>> >>> It sounds like a linked list kind of storage scenario, where every token >>> has pointers to the previous and next token, might better fit your needs >>> for >>> reconstructing a book from the tokens. >>> >>> I never thought about using a linked list before for this application, >> good >> idea. It would certainly speed up the update process, but it would make >> retrieving all tokens for a structure between a start token and end very >> slow as there would need to be a separate query for each of the tokens in >> the structure to look up each next token to retrieve. >> >> >> >> >> Is there a more elegant way to store my data set in CouchDB? >>>> >>> If I were to use CouchDB I think I'd use a document per token. I'd test >>> how >>> expensive updating the sequence id's is (using the HTTP bulk document API >>> [0]) and depending on how often sequence updates need to happen I might >>> switch to use a linked list kind of approach. (You could use the same in >>> a >>> relational database of course.) >>> >>> Are you planning on storing more than just the tokens and their order? If >>> not, I'm wondering what the use of storing a book as a list of tokens >>> actually is. Sounds like a plain text file would do the job as well, but >>> I'm >>> sure there is a point. :o) >>> >>> As I mentioned above, metadata and related data are both going to be >> externally attached to each token at various sources, so each token needs >> to referenced by ID. This fact alone invalidates a single-document >> approach >> because parts of a document can't be linked to, correct? >> >> >> >> Note that I am very new to CouchDB and am ignorant of a lot of its >>>> >>> features. >>> >>> The Definitive Guide [1] is a nice read. >>> >>> Thanks for the advice! >> >> >> >> Nils. >>> >>> [0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API >>> [1] http://guide.couchdb.org/ >>> ------------------------------------------------------------------------ >>> VPRO >>> phone: +31(0)356712911 >>> e-mail: [email protected] >>> web: www.vpro.nl >>> ------------------------------------------------------------------------ >>> >>> >> >>
