Re: [ANNOUNCE] Alexander Shorin joins the PMC
Congrats Alexander; a very passionate and intelligent engineer that CouchDB is lucky to have engaged. On Fri, Feb 21, 2014 at 1:08 PM, Noah Slater nsla...@apache.org wrote: Dear community, I am delighted to announce that Alexander Shorin joins the Apache CouchDB Project Management Committee today. Alexander has made outstanding, sustained contributions to the project. This appointment is an official acknowledgement of his position within the community, and our trust in his ability to provide oversight for the project. Everybody, please join me in congratulating Alexander! On behalf of the CouchDB PMC, -- Noah Slater https://twitter.com/nslater
[jira] [Commented] (COUCHDB-1754) Implement ubjson support
[ https://issues.apache.org/jira/browse/COUCHDB-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616229#comment-13616229 ] Riyad Kalla commented on COUCHDB-1754: -- There might be two interesting SOC projects in relation to UBJSON: 1. Supporting UBJSON over-the-wire (I assume that is what this ticket is for?) 2. Supporting UBJSON as the on-disk binary data format for the CouchDB DB file, allowing records to be streamed directly off of disk with no serialization step. At least this was the original intent of UBJSON. I know there are likely a boat load of details around #2 that I am not aware of that might make that an unlikely SOC project, but just wanted to throw it out there incase anyone was interested. I've always been curious to see what kind of performance water-level Couch could hit if we could remove the serialization/deserialization steps. Implement ubjson support Key: COUCHDB-1754 URL: https://issues.apache.org/jira/browse/COUCHDB-1754 Project: CouchDB Issue Type: Improvement Components: HTTP Interface Reporter: Paul Joseph Davis Labels: gsoc2013,, http Just remembered this thing. For a GSOC project it might prove useful as a what would it take to support alternative data formats that are representable as JSON structures. http://ubjson.org/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (COUCHDB-1754) Implement ubjson support
[ https://issues.apache.org/jira/browse/COUCHDB-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616242#comment-13616242 ] Riyad Kalla commented on COUCHDB-1754: -- Fair point; I think we could profile couch under load and find a rough idea of what the serialization costs (as far as a perf payoff), but certainly understand if we are just going to be in there hacking up the world, let's fix it for everything instead of just this 1-off. Certainly agree with that. Implement ubjson support Key: COUCHDB-1754 URL: https://issues.apache.org/jira/browse/COUCHDB-1754 Project: CouchDB Issue Type: Improvement Components: HTTP Interface Reporter: Paul Joseph Davis Labels: gsoc2013,, http Just remembered this thing. For a GSOC project it might prove useful as a what would it take to support alternative data formats that are representable as JSON structures. http://ubjson.org/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Unique instance IDs?
Robert, to your point, when would you copy an entire DB to another live server and *not* have the intention of the two be replicas of each other? The UUID would be akin to a social-security-number for the docs, so if I have a doc A in Server1 with a SSN of 333-90-3231, and that same doc existed on another server, say Servers 2 and 3, I would *expect* that any doc claiming that social security number would be making a claim of identity with any other doc with the same SSN. The same way you might resolve a record for someone at the DMV along with the IRS to establish they are the same person. The same way two docs in two different locations with the same ID are an indicator of identity. Especially with a UUID (i.e. the chances of that equality existing by chance as compared to a sequential value are astronomically small). Seems like UUIDs are an excellent way to go for doc IDs if you need them (and not some more domain-specific ID). On Thu, Feb 2, 2012 at 8:43 AM, Kevin R. Coombes kevin.r.coom...@gmail.comwrote: I don't think it's such a slam-dunk, but maybe I'm guilty of a goal-tending violation. It depends on whether you intend to keep replicating the two copies to keep them synchronized. If you want that (and most of the time, I do), then I think it's an advantage to keep the same UUID. The (matching) internal UUIDs indicates that you think of these two things as the same abstract entity. The full URL distinguishes the physical copies. If on the other hand, you expect the copies to evolve into something differently over time, then I can see how you might want different UUIDs. But again, the full, easy-to-construct URL distinguishes them. As far as I can tell, the existing system still lets you have it both ways On 2/2/2012 8:45 AM, Robert Newson wrote: ... until you copy the database (and its uuid) and have two databases with the same uuid. This has always been the slam-dunk argument against database uuid's. B. On 2 February 2012 09:41, Kevin R. Coombeskevin.r.coombes@gmail.**comkevin.r.coom...@gmail.com wrote: For CouchDB, I think UUIDs are clearly the way to go. After all, given the UUID, database, and hostname, you can construct the desired URL directly by forming http://hostname:5984/database/**UUIDhttp://hostname:5984/database/UUID As Noah points out, if you used this entire URL as the identifier (by which I assume he means the _id field), then you would lose the ability to copy the document elsewhere. This would, of course, break replication completely. Keeping the UUIDs as they are gives the best of both worlds. Easy replication, and (as long as the database is hosted at the same place) an easy way for humans and programs to construct stable URIs or URLs that point to each document. -- Kevin On 1/22/2012 12:44 PM, Noah Slater wrote: Sorry to bump this old thread, but just going through my backlog. With regard to URLs, I think there is some confusion about the purpose of a URL here. If I write a a cool essay, say, and I stick that up at nslater.org/my-cool-essay, then I can link to it from other places on the web using that address. I might also want to put my cool essay on Dropbox, or post it to Tumblr, or send it in an email. Now my cool essay has lots of URLs. Each one of them perfectly valid. I don't have to go and edit the original copy at nslater.org/my-cool-essay, because I am making copies of it. My cool essay is completely unaware of the URLs that are being used to point to it. And it doesn't care that many URLs point to it. Yes, URLs can be used as identifiers. But when you do this, you tie the thing you're naming to the place you're hosting it. Sometimes that is useful, other times it will cripple you. There is nothing about URLs that requires you to do this. I would hazard a guess that 99% of URLs are de-coupled from the things they point to. WebArch is much more robust when the identity of the object is de-coupled from the URL. Look at Atom, the ID element is supposed to be a URL, but they recommend a non-dereferencable format, precisely to decouple posts from the location you happen to be hosting them this month. Hey, if we're gonna use URLs, maybe we want to go down the same route? http://en.wikipedia.org/wiki/**Tag_URIhttp://en.wikipedia.org/wiki/Tag_URI At this point, I'm not sure what they buy us over UUIDs. Thoughts? Thanks, N
Re: How are key values stored in B+ tree nodes on-disk?
resending, Gmail is telling me this never sent... Randall and Paul, I appreciate the explanation and links to the exact spots in code. This helped a lot. I've been reading up on alternative database impls; InnoDB, for example, caps the key length at some upper bound and then truncates the key and puts the remainder in an overflow table; naturally this requires 1 more random I/O every time you match a long key. At the extreme case (where every key doesn't fit and parts of it are in the overflow table) I think you end up with key-table idea you were toying with Paul, or something related. Paul, I feel your pain re: alternative approaches to this. If you can do away with RANGE queries, moving to MD5 or SHA hashes (or something even faster like MurmurHash) is a nice way to get consistently sized keys which gives you the advantage of paging in a block in raw bytes and only deserializing the 16 or 20-byte chunks in the positions of a standard binary-search until you find what you want instead of the whole node. I *really* want to find a way to make that work, but people seem to like RANGE queries. I asked this question on SO and a gentleman suggested a header at the beginning of the block that actually lists the offsets of all the keys in the node. For node/block sizes of 32KB of smaller (which seem reasonable), you could safely get away with a 16-bit int for storing the offsets; assuming a typical key size, I imagine you wouldn't have more than 100 or 200 keys in a node; assume 150, that is 300 bytes worth of offsets at the beginning of the block. So we spend 7% (assuming 4096kb block size) of the block providing offsets to help make key searching faster. If you were willing to cap blocks to no more than 255 keys you could use 8-bit offset values (meaning any key smaller than 8 bytes would get you in trouble). A limitation to this idea though is that when you need to overflow a single block because you are indexing so many or such large keys, you now end up with two blocks next to each other with 1 of them having no header. You can fix this by having a hard-set-block-size for the DB (say 4KB) that never changes and can never be configured. That way every block, everywhere, has a very tiny type header; maybe 8-bit. If the normal bit is set, then that block would have an offset list at the beginning of it before the key data. If the overflow bit was set, than that block is a continuation of the block before it, and right after the 8-bit block header is raw key data that needs to be considered. Just an idea, I am sure there are more elegant ways to solve this. I am more curious if there is something here... something interesting and worth considering, or if I am just enjoying the smell of my own farts over here ;) -R On Mon, Jan 23, 2012 at 1:39 AM, Paul Davis paul.joseph.da...@gmail.comwrote: On Mon, Jan 23, 2012 at 2:26 AM, Randall Leeds randall.le...@gmail.com wrote: On Sun, Jan 22, 2012 at 04:49, Riyad Kalla rka...@gmail.com wrote: In my ever-growing need to ask the most esoteric file-format questions possible, I am curious if anyone can point me in the direction of the answer here... Background - Consider the _id and _seq B+ indices whose change-paths are appended to the end of the Couch data file after every update. Each of the nodes written have a basic structure of [key, value] where the key is the _id or _seq value being indexed, and the value is an offset pointer to the next node in the tree to seek to (or in the case of a leaf, an offset directly into the datafile where the updated record lives). General --- Efficiently searching the keys stored in a node of a B+ tree after it has been paged in require (let's assume) that the values are in sorted order; then the keys can be binary-searched to find the next offset or block ID on-disk to jump to. To be able to efficiently jump around key values like that, they must all be of the same size. For example, let's say we store 133 keys per node (assume it is full). That means our first comparison after paging in that block should be at the 67th key. The only way I can jump to that position is to know every key value is say 16 bytes and then jump forward 67 * 16 = 1072 bytes before decoding the key from raw bytes and comparing. Question - Given the massive potential for document _id values to vary in size, it seems impossible to have a hard-coded key size that couch utilizes here. Even something sufficiently large like 32 or 64 bytes would still run into situations where a longer key was being indexed or a *very short* key is being indexed and there are multiple magnitudes of wasted space in the index. The few solutions I've found to variable-length keys online require a sequential processing of the keys in a node because the keys will be written like: [key length][key data][offset] This is certainly doable
How are key values stored in B+ tree nodes on-disk?
In my ever-growing need to ask the most esoteric file-format questions possible, I am curious if anyone can point me in the direction of the answer here... Background - Consider the _id and _seq B+ indices whose change-paths are appended to the end of the Couch data file after every update. Each of the nodes written have a basic structure of [key, value] where the key is the _id or _seq value being indexed, and the value is an offset pointer to the next node in the tree to seek to (or in the case of a leaf, an offset directly into the datafile where the updated record lives). General --- Efficiently searching the keys stored in a node of a B+ tree after it has been paged in require (let's assume) that the values are in sorted order; then the keys can be binary-searched to find the next offset or block ID on-disk to jump to. To be able to efficiently jump around key values like that, they must all be of the same size. For example, let's say we store 133 keys per node (assume it is full). That means our first comparison after paging in that block should be at the 67th key. The only way I can jump to that position is to know every key value is say 16 bytes and then jump forward 67 * 16 = 1072 bytes before decoding the key from raw bytes and comparing. Question - Given the massive potential for document _id values to vary in size, it seems impossible to have a hard-coded key size that couch utilizes here. Even something sufficiently large like 32 or 64 bytes would still run into situations where a longer key was being indexed or a *very short* key is being indexed and there are multiple magnitudes of wasted space in the index. The few solutions I've found to variable-length keys online require a sequential processing of the keys in a node because the keys will be written like: [key length][key data][offset] This is certainly doable, but less than optimal. I am curious how CouchDB handles this? Thank you for any pointers. -R
Re: RFC: Releasing 1.2.0
I know this discussion took place sometime after the last release; but was it decided which platform binaries would be provided for the upcoming 1.2 release to help ease adoption? -R On Mon, Jan 9, 2012 at 4:24 PM, Noah Slater nsla...@tumbolia.org wrote: If a bug is marked as fixed in X version, does that not provide this possibility? Other than that, are we ready to go? On Mon, Jan 9, 2012 at 10:19 PM, Filipe David Manana fdman...@apache.org wrote: On Mon, Jan 9, 2012 at 9:40 PM, till t...@php.net wrote: Just looked at CHANGES, a few questions: Hi Till, answers below. 1) Can anyone go into detail on the replicator? (E.g. where are the configuration options.) * https://issues.apache.org/jira/browse/COUCHDB-1024 * http://s.apache.org/KsY * http://wiki.apache.org/couchdb/Replication#New_features_introduced_in_CouchDB_1.2.0 There's more stuff out there I don't recall at the moment. 2) Can anyone mention how snappy/zlib impacts performance? * https://issues.apache.org/jira/browse/COUCHDB-1120 Most of the stuff is a little vague for my taste – maybe the Jira issues explain the additions better. If so, should they be linked always? Personally, I would like to link a list of jira tickets to each release. I don't recall if this possibility was ever discussed before. Till On Mon, Jan 9, 2012 at 10:34 PM, till t...@php.net wrote: Can haz release? :) On Fri, Jan 6, 2012 at 11:34 PM, Randall Leeds randall.le...@gmail.com wrote: On Fri, Jan 6, 2012 at 13:22, Robert Newson rnew...@apache.org wrote: I'm checking it now. Jan did his JIRA dance, so I think we're clean on that end. I'm looking into whether all the fixes got backported, etc, and I'll also read over the full 1.1.x-1.2.x diff. Just checked over my commits and everything appropriate is back-ported. Good to go from my end. +1 on calling the vote. B. On 6 January 2012 21:19, Noah Slater nsla...@tumbolia.org wrote: Hey chaps, Let's pull our finger out and get 1.2.0 shipped. Do we have everything in trunk that we want in 1.2.0, or are there any patches y'all are sitting on? Also, can you all check NEWS and CHANGES to make sure they are up to date? Thanks all, N -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Understanding the CouchDB file format
Jan, Thank you and yes, I'd be happy to contribute this to the wiki. I made some edits early this morning after some feedback. If a few folks in-the-know could give it a quick read-through to make sure I didn't get anything wrong then I'm happy to work it up on the wiki as well (or send it along to the wiki's editor for inclusion). Just point the way. Best, Riyad On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt j...@apache.org wrote: Good writeup! Would you consider contributing it to the CouchDB Wiki? http://wiki.apache.org/couchdb/ Cheers Jan -- On Dec 21, 2011, at 21:28 , Riyad Kalla wrote: Bob, Really appreciate the link; Rick has a handful of articles that helped a lot. Along side all the CouchDB reading I've been looking at SSD-optimized data storage mechanisms and tried to coalesce all of this information into this post on Couch's file storage format: https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv It is uncanny how many things Couch seems to have gotten right with regard to existing storage systems and future flash-based storage systems. I'd appreciate any corrections, additions or feedback to the post for anyone interested. Best, R On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne dio...@dionne-associates.com wrote: I think this is largely correct Riyad, I dug out an old article[1] by Rick Ho that you may also find helpful though it might be slightly dated. Generally the best performance will be had if the ids are sequential and updates are done in bulk. Write heavy applications will eat up a lot of space and require compaction. At the leaf nodes what are stored are either full_doc_info records or doc_info records which store pointers to the data so the main thing that impacts the branching at each level are the key size and in the case of views the sizes of the reductions as these are stored with the intermediate nodes. All in all it works pretty well but as always you need to test and evaluate it for you specific case to see what the limits are. Regards, Bob [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: Adding to this conversation, I found this set of slides by Chris explaining the append-only index update format: http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed Specifically slides 16, 17 and 18. Using this example tree, rewriting the updated path (in reverse order) appended to the end of the file makes sense... you see how index queries can simply read backwards from the end of the file and not only find the latest revisions of docs, but also every other doc that wasn't touched (it will just seek into the existing inner nodes of the b+ tree for searching). What I am hoping for clarification on is the following pain-point that I perceive with this approach: 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves to elements are short (typically no more than 3 to 5 levels deep) as opposed to a trie or some other construct that would have much longer paths to elements. 2. Because the depth of the tree is so shallow, the breadth of it becomes large to compensate... more specifically, each internal node can have 100s, 1000s or more children. Using the example slides, consider the nodes [A...M] and [R...Z] -- in a good sized CouchDB database, those internal index nodes would have 100s (or more) elements in them pointing at deeper internal nodes that themselves had thousands of elements; instead of the 13 or so as implied by [A...M]. 3. Looking at slide 17 and 18, where you see the direct B+ tree path to the update node getting appended to the end of the file after the revision is written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that those internal nodes with *all* their child elements are getting rewritten as well. In this example tree, it is isn't such a big issue... but in a sufficiently large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be quite large... I don't know the format of the node elements in the B+ tree, but it would be whatever the size of a node is times however many elements are contained at each level (1 for root, say 100 for level 2, 1000 for level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on here, of course it depends on the size of the data store). Am I missing something or is CouchDB really rewriting that much index information between document revisions on every update? What was previously confusing me is I thought it was *only* rewriting a direct path to the updated revision, like [B][E][J'] and Couch was some-how patching in that updated path info to the B+ index at runtime. If couch is rewriting entire node paths with all their elements then I am
Re: Understanding the CouchDB file format
On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson rnew...@apache.org wrote: It reads well as an article but needs some polish before it could be a great wiki page. I suggest that if it does go up, it is clearly marked as a draft, and we all chip in to sculpt it into shape. Great idea. Agreed that it is no where as clean/factual as it would need to be for an appropriate wiki/ref-doc-style entry, but if it can be massaged into that, it would hopefully make a great resource. There are a few parts of the article that are inaccurate (like the assertion we have good locality for the id and seq trees. If this were true we wouldn't have seen such a huge improvement in compaction by temporarily separating them). I'd look forward to more detail on this... it was my understanding the updates were appended out in the [doc rev][_id idx diff][_seq idx diff] format at the end of the data file. Sounds like I may have misunderstood that? The 'COMPETE recreation' paragraph also strikes me as factually awry. I'd appreciate a detailed correction on this if it is wrong; all the digging I've done (in this thread and other partial resources) suggests that the path from the changed doc ref back to the root (including a copy of all internal nodes and all of their child references) is written so as being able to read-back into the index from the tail of the data file quickly... specifically slides 17, 18 and 19 from this slidedeck ( http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note that the interim nodes [A..M] and [A..Z] are rewritten (including any and all child pointers they contained). This is what I was referring to; I should either clean up my wording (confusing) or I got it wrong in which case I'd appreciate any and all corrections. Thanks for helping move this forward and the feedback Robert. -Riyad
Re: Understanding the CouchDB file format
Randall, Spot on; we are on the same page now. I'll go through the post again tomorrow morning and reword it a bit so the thoughts aren't so fragmented. Best, Riyad P.S. Are there discussions anywhere on alternative file formats for the indices that the Couch community has considered in the past? (old JIRA or mailing list discussions) or has this file format been the only approach from the initial work started in 05/06? The reason I ask is because of the index fragmentation you mentioned that can occur over time... I'd be curious if an in-place-edited format for the indices as separate file(s) from the append-only document revisions would yield better performance over time. The splits would still be expensive, but you'd be able to pre-allocate your blocks for each node on each split to help a bit. Even leveraging an append-only, pre-allocated approach to the index might work where split nodes are appended to the end of the index file with the full node size (e.g. 133 elements) pre-allocated and the index marked ready for compaction or something. So it would be a hybrid approach... when splits aren't needed, you do an in-place edit to the index on-disk. If a split is needed, you use a technique similar to the one now where the effected node hierarchy is appended to the end of the file. You still run the risk of costly index rebuilds in the cast of a failed in-place edit, but you wouldn't run the risk of any data loss... I am just wondering for large data sets if this would yield some significant benefits putting Couch somewhere between a Mongo and Couch (performance wise). Just pie-in-the-sky thinking... I am sure the senior team here has talked this stuff to death years ago. My apologies if this is re-treading road covered in beaten horse bodies ;) I just find this category of problems interesting. On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds randall.le...@gmail.comwrote: On Thu, Dec 22, 2011 at 12:11, Riyad Kalla rka...@gmail.com wrote: On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson rnew...@apache.org wrote: There are a few parts of the article that are inaccurate (like the assertion we have good locality for the id and seq trees. If this were true we wouldn't have seen such a huge improvement in compaction by temporarily separating them). I'd look forward to more detail on this... it was my understanding the updates were appended out in the [doc rev][_id idx diff][_seq idx diff] format at the end of the data file. Sounds like I may have misunderstood that? Riyad, as you pointed out earlier, all the inner nodes are rewritten up to the root. The two btrees are not written in parallel, though, which means that for deep trees all the updated nodes are written before the other tree's nodes are written. Also remember that the trees themselves end up pretty fragmented since older nodes that haven't changed are back toward the beginning of the file. In general, I'm not sure there's much that's useful to mention about locality in the trees. Also, updating these trees requires reading the old values, so there is still seeking that occurs (if the pages aren't cached by the OS). The 'COMPETE recreation' paragraph also strikes me as factually awry. I'd appreciate a detailed correction on this if it is wrong; all the digging I've done (in this thread and other partial resources) suggests that the path from the changed doc ref back to the root (including a copy of all internal nodes and all of their child references) is written so as being able to read-back into the index from the tail of the data file quickly... specifically slides 17, 18 and 19 from this slidedeck ( http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note that the interim nodes [A..M] and [A..Z] are rewritten (including any and all child pointers they contained). This is what I was referring to; I should either clean up my wording (confusing) or I got it wrong in which case I'd appreciate any and all corrections. Right. It mostly seems a bit confusing to me. it DOES NOT just rewrite the nodes pathing from the leaf to the node and ONLY the connections for that single document That doesn't sound quite right, but I can tell what you're trying to say is accurate. If I'm right, you mean that every changed inner node is rewritten in its entirety rather than having a single pointer to the new child updated in place. Cheers. Thanks for taking the time to write this up. -Randall
Re: Understanding the CouchDB file format
Adding to this conversation, I found this set of slides by Chris explaining the append-only index update format: http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed Specifically slides 16, 17 and 18. Using this example tree, rewriting the updated path (in reverse order) appended to the end of the file makes sense... you see how index queries can simply read backwards from the end of the file and not only find the latest revisions of docs, but also every other doc that wasn't touched (it will just seek into the existing inner nodes of the b+ tree for searching). What I am hoping for clarification on is the following pain-point that I perceive with this approach: 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves to elements are short (typically no more than 3 to 5 levels deep) as opposed to a trie or some other construct that would have much longer paths to elements. 2. Because the depth of the tree is so shallow, the breadth of it becomes large to compensate... more specifically, each internal node can have 100s, 1000s or more children. Using the example slides, consider the nodes [A...M] and [R...Z] -- in a good sized CouchDB database, those internal index nodes would have 100s (or more) elements in them pointing at deeper internal nodes that themselves had thousands of elements; instead of the 13 or so as implied by [A...M]. 3. Looking at slide 17 and 18, where you see the direct B+ tree path to the update node getting appended to the end of the file after the revision is written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that those internal nodes with *all* their child elements are getting rewritten as well. In this example tree, it is isn't such a big issue... but in a sufficiently large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be quite large... I don't know the format of the node elements in the B+ tree, but it would be whatever the size of a node is times however many elements are contained at each level (1 for root, say 100 for level 2, 1000 for level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on here, of course it depends on the size of the data store). Am I missing something or is CouchDB really rewriting that much index information between document revisions on every update? What was previously confusing me is I thought it was *only* rewriting a direct path to the updated revision, like [B][E][J'] and Couch was some-how patching in that updated path info to the B+ index at runtime. If couch is rewriting entire node paths with all their elements then I am no longer confused about the B+ index updates, but am curious about the on-disk cost of this. In my own rough insertion testing, that would explain why I see my collections absolutely explode in size until they are compacted (not using bulk insert, but intentionally doing single inserts for a million(s) of docs to see what kind of cost the index path duplication would be like). Can anyone confirm/deny/correct this assessment? I want to make sure I am on the right track understanding this. Best wishes, Riyad On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla rka...@gmail.com wrote: @Filipe - I was just not clear on how CouchDB operated; you and Robert cleared that up for me. Thank you. @Robert - The writeup is excellent so far (I am not familiar with erlang, so there is a bit of stickiness there), thank you for taking the time to put this together! At this point I am curious how the _id and _seq indices are read as their data is continually appended to the end of the data file in small diff-trees for every updated doc. If CouchDB kept all the indices in-memory and simply patched-in the updated paths at runtime (maybe something akin to memory-mapped indices in MongoDB) I would be fairly clear on the operation... but as I understand it, CouchDB keeps such a small memory footprint by doing no in-memory caching and relying on the intelligence of the OS and filesystem (and/or drives) to cache frequently accessed data. I am trying to understand the logic used by CouchDB to answer a query using the index once updates to the tree have been appended to the data file... for example, consider a CouchDB datastore like the one Filipe has... 10 million documents and let's say it is freshly compacted. If I send in a request to that Couch instance, it hits the header of the data file along with the index and walks the B+ tree to the leaf node, where it finds the offset into the data file where the actual doc lives... let's say 1,000,000 bytes away. These B+ trees are shallow, so it might look something like this: Level 1: 1 node, root node. Level 2: 100 nodes, inner child nodes Level 3: 10,000 nodes, inner child nodes Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in the data file. Now let's say I write 10 updates to documents in that file. There are 10 new revisions appended to the end of the data
Re: Understanding the CouchDB file format
Bob, Really appreciate the link; Rick has a handful of articles that helped a lot. Along side all the CouchDB reading I've been looking at SSD-optimized data storage mechanisms and tried to coalesce all of this information into this post on Couch's file storage format: https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv It is uncanny how many things Couch seems to have gotten right with regard to existing storage systems and future flash-based storage systems. I'd appreciate any corrections, additions or feedback to the post for anyone interested. Best, R On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne dio...@dionne-associates.com wrote: I think this is largely correct Riyad, I dug out an old article[1] by Rick Ho that you may also find helpful though it might be slightly dated. Generally the best performance will be had if the ids are sequential and updates are done in bulk. Write heavy applications will eat up a lot of space and require compaction. At the leaf nodes what are stored are either full_doc_info records or doc_info records which store pointers to the data so the main thing that impacts the branching at each level are the key size and in the case of views the sizes of the reductions as these are stored with the intermediate nodes. All in all it works pretty well but as always you need to test and evaluate it for you specific case to see what the limits are. Regards, Bob [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: Adding to this conversation, I found this set of slides by Chris explaining the append-only index update format: http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed Specifically slides 16, 17 and 18. Using this example tree, rewriting the updated path (in reverse order) appended to the end of the file makes sense... you see how index queries can simply read backwards from the end of the file and not only find the latest revisions of docs, but also every other doc that wasn't touched (it will just seek into the existing inner nodes of the b+ tree for searching). What I am hoping for clarification on is the following pain-point that I perceive with this approach: 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves to elements are short (typically no more than 3 to 5 levels deep) as opposed to a trie or some other construct that would have much longer paths to elements. 2. Because the depth of the tree is so shallow, the breadth of it becomes large to compensate... more specifically, each internal node can have 100s, 1000s or more children. Using the example slides, consider the nodes [A...M] and [R...Z] -- in a good sized CouchDB database, those internal index nodes would have 100s (or more) elements in them pointing at deeper internal nodes that themselves had thousands of elements; instead of the 13 or so as implied by [A...M]. 3. Looking at slide 17 and 18, where you see the direct B+ tree path to the update node getting appended to the end of the file after the revision is written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that those internal nodes with *all* their child elements are getting rewritten as well. In this example tree, it is isn't such a big issue... but in a sufficiently large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be quite large... I don't know the format of the node elements in the B+ tree, but it would be whatever the size of a node is times however many elements are contained at each level (1 for root, say 100 for level 2, 1000 for level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on here, of course it depends on the size of the data store). Am I missing something or is CouchDB really rewriting that much index information between document revisions on every update? What was previously confusing me is I thought it was *only* rewriting a direct path to the updated revision, like [B][E][J'] and Couch was some-how patching in that updated path info to the B+ index at runtime. If couch is rewriting entire node paths with all their elements then I am no longer confused about the B+ index updates, but am curious about the on-disk cost of this. In my own rough insertion testing, that would explain why I see my collections absolutely explode in size until they are compacted (not using bulk insert, but intentionally doing single inserts for a million(s) of docs to see what kind of cost the index path duplication would be like). Can anyone confirm/deny/correct this assessment? I want to make sure I am on the right track understanding this. Best wishes, Riyad On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla rka...@gmail.com wrote: @Filipe - I was just not clear on how CouchDB operated; you and Robert cleared that up for me
Re: Understanding the CouchDB file format
Thank you Robert, fixed. On Wed, Dec 21, 2011 at 1:42 PM, Robert Dionne dio...@dionne-associates.com wrote: Riyad, Your welcome. At a quick glance your post has one error, internal nodes do contain values (from the reductions). The appendix in the couchdb book also makes this error[1] which I've opened a ticket for. Cheers, Bob [1] https://github.com/oreilly/couchdb-guide/issues/450 On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote: Bob, Really appreciate the link; Rick has a handful of articles that helped a lot. Along side all the CouchDB reading I've been looking at SSD-optimized data storage mechanisms and tried to coalesce all of this information into this post on Couch's file storage format: https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv It is uncanny how many things Couch seems to have gotten right with regard to existing storage systems and future flash-based storage systems. I'd appreciate any corrections, additions or feedback to the post for anyone interested. Best, R On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne dio...@dionne-associates.com wrote: I think this is largely correct Riyad, I dug out an old article[1] by Rick Ho that you may also find helpful though it might be slightly dated. Generally the best performance will be had if the ids are sequential and updates are done in bulk. Write heavy applications will eat up a lot of space and require compaction. At the leaf nodes what are stored are either full_doc_info records or doc_info records which store pointers to the data so the main thing that impacts the branching at each level are the key size and in the case of views the sizes of the reductions as these are stored with the intermediate nodes. All in all it works pretty well but as always you need to test and evaluate it for you specific case to see what the limits are. Regards, Bob [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: Adding to this conversation, I found this set of slides by Chris explaining the append-only index update format: http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed Specifically slides 16, 17 and 18. Using this example tree, rewriting the updated path (in reverse order) appended to the end of the file makes sense... you see how index queries can simply read backwards from the end of the file and not only find the latest revisions of docs, but also every other doc that wasn't touched (it will just seek into the existing inner nodes of the b+ tree for searching). What I am hoping for clarification on is the following pain-point that I perceive with this approach: 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves to elements are short (typically no more than 3 to 5 levels deep) as opposed to a trie or some other construct that would have much longer paths to elements. 2. Because the depth of the tree is so shallow, the breadth of it becomes large to compensate... more specifically, each internal node can have 100s, 1000s or more children. Using the example slides, consider the nodes [A...M] and [R...Z] -- in a good sized CouchDB database, those internal index nodes would have 100s (or more) elements in them pointing at deeper internal nodes that themselves had thousands of elements; instead of the 13 or so as implied by [A...M]. 3. Looking at slide 17 and 18, where you see the direct B+ tree path to the update node getting appended to the end of the file after the revision is written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that those internal nodes with *all* their child elements are getting rewritten as well. In this example tree, it is isn't such a big issue... but in a sufficiently large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be quite large... I don't know the format of the node elements in the B+ tree, but it would be whatever the size of a node is times however many elements are contained at each level (1 for root, say 100 for level 2, 1000 for level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on here, of course it depends on the size of the data store). Am I missing something or is CouchDB really rewriting that much index information between document revisions on every update? What was previously confusing me is I thought it was *only* rewriting a direct path to the updated revision, like [B][E][J'] and Couch was some-how patching in that updated path info to the B+ index at runtime. If couch is rewriting entire node paths with all their elements then I am no longer confused about the B+ index updates, but am curious about the on-disk cost of this. In my own rough insertion testing, that would explain why I see
Understanding the CouchDB file format
I've been reading everything I can find on the CouchDB file format[1] and am getting bits and pieces here and there, but not a great, concrete, step-by-step explanation of the process. I'm clear on the use of B+ trees and after reading a few papers on the benefits of log-structured file formats, I understand the benefits of inlining the B+ tree indices directly into the data file as well (locality + sequential I/O)... what I'm flummoxed about is how much of the B+ tree's index is rewritten after every modified document. Consider a CouchDB file that looks more or less like this: [idx/header][doc1, rev1][idx/header][doc1, rev2] After each revised doc is written and the b-tree root is rewritten after that, is that just a modified root node of the B+ tree or the entire B+ tree? The reason I ask is because regardless of the answer to my previous question, for a *huge* database will millions of records, that seems like an enormous amount of data to rewrite after every modification. Say the root node had a fanning factor of 133; that would still be alot of data to rewrite. I am certain I am missing the boat on this; if anyone can pull me out of the water and point me to dry land I'd appreciate it. Best, R [1] -- http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html -- http://blog.kodekabuki.com/post/132952897/couchdb-naked -- http://guide.couchdb.org/editions/1/en/btree.html -- http://ayende.com/blog/* (Over my head)
Re: Understanding the CouchDB file format
Filipe, Thank you for the reply. Maybe I am misunderstanding exactly what couch is writing out; the docs I've read say that it rewrites the root node -- I can't tell if the docs mean the parent node of the child doc that was changed (as one of the b+ leaves) or if it means the direct path, from the root, to that individual doc... or if it means the *entire* index... In the case of even rewriting the single parent, with such a shallow tree, each internal leaf will have a huge fan of nodes; let's say 1-10k in a decent sized data set. If you are seeing a few K of extra written out after each changed doc then that cannot be write... I almost assumed my understanding was wrong because the sheer volume of data would make performance abysmal if it was true. Given that... is it just the changed path, from the root to the new leaf that is rewritten? That makes me all sorts of curious as to how Couch updates/searches the new modified index with the small diff that is written out. Any pointers to reading that will help me dig down on this (even early bugs in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on Damien's blog to see if it wrote about it in depth and so far haven't found anything as detailed as I am hoping for on this architecture. Best, Riyad On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana fdman...@apache.orgwrote: On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla rka...@gmail.com wrote: I've been reading everything I can find on the CouchDB file format[1] and am getting bits and pieces here and there, but not a great, concrete, step-by-step explanation of the process. I'm clear on the use of B+ trees and after reading a few papers on the benefits of log-structured file formats, I understand the benefits of inlining the B+ tree indices directly into the data file as well (locality + sequential I/O)... what I'm flummoxed about is how much of the B+ tree's index is rewritten after every modified document. Consider a CouchDB file that looks more or less like this: [idx/header][doc1, rev1][idx/header][doc1, rev2] After each revised doc is written and the b-tree root is rewritten after that, is that just a modified root node of the B+ tree or the entire B+ tree? The reason I ask is because regardless of the answer to my previous question, for a *huge* database will millions of records, that seems like an enormous amount of data to rewrite after every modification. Say the root node had a fanning factor of 133; that would still be alot of data to rewrite. Hi Riyad, Have you observed that in practice? Typically the depth of database btrees is not that high even for millions of documents. For example I have one around with about 10 million documents which doesn't have more than 5 or 6 levels if I recall correctly. So updating a doc, for that particular case, means rewriting 5 or 6 new nodes plus the document itself. Each node is normally not much bigger than 1.2Kb. I've written once a tool to analyze database files which reports btree depths, however it's not updated to work with recent changes on master/1.2.x such as snappy compression and btree sizes: https://github.com/fdmanana/couchfoo It should work with CouchDB 1.1 (and older) database files. I am certain I am missing the boat on this; if anyone can pull me out of the water and point me to dry land I'd appreciate it. Best, R [1] -- http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html -- http://blog.kodekabuki.com/post/132952897/couchdb-naked -- http://guide.couchdb.org/editions/1/en/btree.html -- http://ayende.com/blog/* (Over my head) -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Understanding the CouchDB file format
@Filipe - I was just not clear on how CouchDB operated; you and Robert cleared that up for me. Thank you. @Robert - The writeup is excellent so far (I am not familiar with erlang, so there is a bit of stickiness there), thank you for taking the time to put this together! At this point I am curious how the _id and _seq indices are read as their data is continually appended to the end of the data file in small diff-trees for every updated doc. If CouchDB kept all the indices in-memory and simply patched-in the updated paths at runtime (maybe something akin to memory-mapped indices in MongoDB) I would be fairly clear on the operation... but as I understand it, CouchDB keeps such a small memory footprint by doing no in-memory caching and relying on the intelligence of the OS and filesystem (and/or drives) to cache frequently accessed data. I am trying to understand the logic used by CouchDB to answer a query using the index once updates to the tree have been appended to the data file... for example, consider a CouchDB datastore like the one Filipe has... 10 million documents and let's say it is freshly compacted. If I send in a request to that Couch instance, it hits the header of the data file along with the index and walks the B+ tree to the leaf node, where it finds the offset into the data file where the actual doc lives... let's say 1,000,000 bytes away. These B+ trees are shallow, so it might look something like this: Level 1: 1 node, root node. Level 2: 100 nodes, inner child nodes Level 3: 10,000 nodes, inner child nodes Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in the data file. Now let's say I write 10 updates to documents in that file. There are 10 new revisions appended to the end of the data file *each one* separated by a rewritten B+ path to a leaf node with it's new location at the end of the file. Each of those paths written between each doc revision (say roughly 2k like Filipe mentioned) are just 4 item paths... root - level1 - level2 - level3 -- level4... showing the discrete path from the root to that individual updated doc. The intermediary levels (l1, 2, 3) are not fully flushed out with all the OTHER children from the original b+ tree index. [[ is this correct so far? If not, please point out my mistakes...] Now I issue a query for a document that WAS NOT updated... this is where I get confused on the logic *** this would mean I need to access the original B+ tree index at the root of the data file, because the revised B+ paths that are written between each of the updated doc revisions at the end of the file are not full indices. NOW consider I want to query for one of the changed docs... now I suddenly need to scan backwards from the data file's end to find the updated path to the new revision of that document. (obviously) this isn't what Couch is actually doing... it's doing something more elegant, I just can't figure out what or how and that is what I was hoping for help with. Much thanks guys, I know this is a heavy question to ask. Best wishes, R On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne dio...@dionne-associates.com wrote: Robert Dionne Computer Programmer dio...@dionne-associates.com 203.231.9961 On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: Filipe, Thank you for the reply. Maybe I am misunderstanding exactly what couch is writing out; the docs I've read say that it rewrites the root node -- I can't tell if the docs mean the parent node of the child doc that was changed (as one of the b+ leaves) or if it means the direct path, from the root, to that individual doc... or if it means the *entire* index... In the case of even rewriting the single parent, with such a shallow tree, each internal leaf will have a huge fan of nodes; let's say 1-10k in a decent sized data set. If you are seeing a few K of extra written out after each changed doc then that cannot be write... I almost assumed my understanding was wrong because the sheer volume of data would make performance abysmal if it was true. Given that... is it just the changed path, from the root to the new leaf that is rewritten? Hi Riyad, You are correct, it's only the changed path. Interestingly I've just started to document all these internals[1] along with links to the code and other references available. Cheers, Bob [1] http://bdionne.github.com/couchdb/ That makes me all sorts of curious as to how Couch updates/searches the new modified index with the small diff that is written out. Any pointers to reading that will help me dig down on this (even early bugs in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on Damien's blog to see if it wrote about it in depth and so far haven't found anything as detailed as I am hoping for on this architecture. Best, Riyad On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana fdman...@apache.orgwrote: On Tue, Dec 20
Re: Fast JSON Parsing: UltraJSON / RapidJSON
Paul, you might be interested in this benchmark as well of 3 C++ JSON parsers (summary: libjson dominated the test by orders of magnitude) http://lijoblogs.blogspot.com/2011/11/comparison-and-benchmark-of-c-json.html On Mon, Nov 21, 2011 at 10:23 AM, Paul Davis paul.joseph.da...@gmail.comwrote: On Mon, Nov 21, 2011 at 11:11 AM, Riyad Kalla rka...@gmail.com wrote: I believe trunk now utilizes the C JSON parser for some nice speed improvements and thought this might be worth sharing. Over on the JSON spec list two devs were sharing attempts at writing the fastest JSON parsers in C and C++ respectively. UltraJSON (C) claims to have the fastest parsing speed at the moment: https://github.com/esnme/ultrajson RapidJSON (C++) is attempting to be the fastest as well: http://code.google.com/p/rapidjson/ I thought it interesting enough to share these announcements; I realize there is much more needed from a JSON parser than just sheer speed. -R Oooohhh, interesting. I should see how Jiffy compares. Thanks for the note.
Canonical's decision to move away from CouchDB
REF: http://www.h-online.com/open/news/item/Canonical-dropping-CouchDB-from-Ubuntu-One-1382809.html Lenton says Canonical has worked with the company behind CouchDB to make the database scale in a particular way. Our situation is rather unique and we were unable to resolve some of the issues we came across, said Lenton, who pointed out they needed to scale to millions of users and scale down to a reasonable load on small client machines. Does anyone happen to know exactly what was the blocker(s) for Canonical? e.g. client load time for small mobile devices?
Fast JSON Parsing: UltraJSON / RapidJSON
I believe trunk now utilizes the C JSON parser for some nice speed improvements and thought this might be worth sharing. Over on the JSON spec list two devs were sharing attempts at writing the fastest JSON parsers in C and C++ respectively. UltraJSON (C) claims to have the fastest parsing speed at the moment: https://github.com/esnme/ultrajson RapidJSON (C++) is attempting to be the fastest as well: http://code.google.com/p/rapidjson/ I thought it interesting enough to share these announcements; I realize there is much more needed from a JSON parser than just sheer speed. -R
Very impressive CouchDB story.
This is an aside and also a you guys rock to the CouchDB dev team... I watched a talk by Benjamin Anderson at Meteor today: http://www.dataversity.net/archives/6714?t=1320768580 where he talked about their experience with running CouchDB (via Cloudant) for 2 years in production, scaling from what I imagine was 2 or 3 nodes to a 14-node cluster. In the last year alone their traffic has grown 5x while their Couch cluster has only grown 20% to cover that. Benjamin is straight forward with some of the shortcomings they have had, but the take-away from this that floored me and is a testament to what you guys do... in all these years after all these 10s of terabytes of data, they have never lost data or had downtime. They just kept growing their cluster, adding nodes as needed and got back to work. No world-ending events, total cluster failures or total rebuilds. Couch just happily chugged along, keeping track of data and smiling. (not discrediting any world Cloudant put in here to make that happen, I just don't know what it was). He also specifically called out how the backend API to Couch has (practically) not changed at all in the last 2 years, allowing them to focus their development efforts on customer-facing features and not spend time writing and rewriting backend persistence features as Couch grew and released new versions. To all of you committers that spend your nights and weekends on Couch, this is a glowing testament to the hard decisions you have made. Really great job guys. -Riyad
Re: Binary Downloads
I don't think that this is too big a problem. It is, it really is. Many of you are too close to the problem to see it, but the 10-second impression when you step into the Couch world (as compared to Mongo, Redis, Orient, Raven or even Cassandra) is ... where do I download and run this thing? How many times have you guys answered Well version 1.1.2 isn't out yet, so what version are you really using? and those aren't even the people that gave up and turned the other direction, those are the ones that actively stayed and dug their way over to couchbase (either knowingly or unknowingly) and grabbed one of their binaries instead and they still don't even know what version of couch they are running. CouchDB needs pre-build binaries, ready to unzip and run exactly as Jan enumerated in his first post. On Sat, Nov 5, 2011 at 6:57 AM, Noah Slater nsla...@tumbolia.org wrote: A few thoughts: I don't think that this is too big a problem. CouchDB is a server tool, and by and large, the type of people who are going to be installing it are going to know how to from source, or are going to be using a system where the application is pre-packaged for them. You wouldn't expect Apache httpd to have one-click installers. Having said that, it's nice that MongoDB have a list of pre-built packages to download. But they don't have a rigorous community voting process in place that requires people to test and vote on each artefact. I think that's a weakness as well as a strength. Just to reiterate. The biggest problem in providing official platform specific binary downloads is that these artefacts would need to be voted on by the community. Dave asks what needs to be in place for them to be official. Nothing more than a vote. And therein lies the problem. Getting enough people, with the right hardware, to vote on all these things at the time of release. I don't have any problem with people, individuals or organisations, who want to build CouchDB for different platforms and make those packages available to our users. One one level, this is exactly what Debian and RedHat, for example, are doing already. I also have no problem linking through to unofficial packages from our downloads page, with the appropriate disclaimers, if it makes it easier for our users. I don't care how these packages are built, and I don't think there should be a submissions process. We'd just link through to any trust-worthy places that promised to keep a reasonably up-to-date package archive available. But let's bare in mind that for each platform, packaging CouchDB is a very specific process. We cannot possibly hope to get that kind of experience within the core committer team. We can't even hope to get all that experience on the mailing lists. So we're left with two options. Either build a set of generic packages that don't fit that well into the specifics of each platform, or commit ourselves to working with, enabling, and promoting specialised efforts by downstream teams. Or at all. MongoDB can do this because they don't use a community process for releases. From what I understand, Jan wants a sort of generic package. A directory that has a self-contained CouchDB that you can run. I guess this could be useful, in the same way CouchDBX was useful. You get to play around with CouchDB real quick. But it's not useful for much more than that, in my opinion. I certainly wouldn't advise anyone to use this for anything serious. There's no integration with the host operating system, and that's a pretty egregious problem for a database server. Jason covers this in his response I think. A CouchDB that lives in a little sandbox like that is a neutered CouchDB. Fun for playing with, but you can't do anything serious with it. And maybe that has a place in our ecosystem, but it should come with the appropriate warnings, delineation. Speaking as the guy who built the first CouchDB package for Debian, I can tell you that building these things requires a lot of specialist, and in some case, arcane knowledge. It also requires a lot of care. I made sure that when you upgrade from one version to another, it would not trash your previous data. Building a proper package is full of this sort of stuff. Any system that promises to automate the process for you is a quick-fix. Not something I would ever recommend. Retooling the source to build proper Erlang applications, or any other similar proposal, does not solve our problem. The problem isn't the lack of technical means to build these packages. The problem lies in being able to build them in the first place (given a lack of varies infrastructure available to the build team) and a lack of people to vote on them at the point of release. If there is any community knowledge in Build CouchDB that we could move upstream, and bake in to our configure script of Makefiles, I would be totally behind that effort. The whole point of configure is to
Re: High latency (40ms) and low request rate (10 rps) under windows
Hmm, I see what you mean. If a single connection isnt pegging any of the hardware and multiple connections are a magnitude times faster, I am not sure what would be keeping the single connection from performning faster. Maybe someone else can hop in with ideas? R On Oct 26, 2011, at 4:32 PM, Konstantin Cherkasoff k.cherkas...@gmail.com wrote: Hi! What happens when you set delayed_commits to true? With delayed_commits I got about 150 rps so delayed_commits mode is 15 times faster. With that off you are requiring an fsync (iirc) after every value written and then are at the mercy at your disks write performance (but not in the sense of burst-write as these are individual, disparate write requests coming in). Yes, I understand that many small writes and fsyncs can be a bottleneck in this case. But I found that the increase of number of concurrent requests lead to an increase in the request rate. For example (100 concurent requests) ab -k -t 10 -c 100 -n 100 -p ab.json -T application/json http://localhost:5984/bench shows about 300 RPS (writes and fsyncs) on the same hardware. So, I suppose that 10 requests per second actually is not the limit for this hard disk. And there may be some other problem. Also, is this build Couch Single Server from Couchbase or some other Windows build of Couch? I tried Couchbase 1.2.0 and this build https://github.com/downloads/dch/couchdb/setup-couchdb-1.1.0+COUCHDB-1152_otp_R14B03.exe -- Konstantin
Re: High latency (40ms) and low request rate (10 rps) under windows
Konstantin, What happens when you set delayed_commits to true? With that off you are requiring an fsync (iirc) after every value written and then are at the mercy at your disks write performance (but not in the sense of burst-write as these are individual, disparate write requests coming in). Also, is this build Couch Single Server from Couchbase or some other Windows build of Couch? -R On Tue, Oct 25, 2011 at 4:03 PM, Konstantin Cherkasoff k.cherkas...@gmail.com wrote: *** crossposted to u...@couchdb.apache.org Hi! I installed CouchDB 1.1.0 on Windows (Winows 7, 32 bit) and tried to test performance using ab.exe(Apache Benchmark http://httpd.apache.org/docs/2.2/programs/ab.html) ab.exe -k -c 1 -n 1000 -p ab.json -T application/json http://localhost:5984/bench where ab.json contains the simplest document {value: test}. This utility do 1000 POSTs with no concurency (-c 1) in single connection (-k for keepalive). On my laptop (it's NOT low cost netbook) I got 10 requests per second.So it is 0.1 second per single request.And CPU and HDD utilization is actually ZERO. I was just wondering what exactly CouchDB doing all this time (0.1 second) with single little record?As the utilization of CPU and HDD is 0%, I believe that they are not the bottleneck. So where is bottleneck? P.S.All test I did with delayed_commits = false.And I tried socket_options = [{nodelay, true}] -- Konstantin Cherkasoff
Re: status of universal binary json in couchdb
I'd be happy to sponsor getting this work into CouchDB proper if anyone was interested. On Tue, Oct 11, 2011 at 10:59 PM, Dominic Tarr dominic.t...@gmail.comwrote: okay, interesting. that sounds a bit more difficult than the parser! but nevertheless the right approach. I'll admit now that I do not know erlang yet. I've been wanting to get into it though, and implementing ubj seemed both interesting and useful. refactoring couchbd, prehaps more challenging. good to know the lie of the land, thanks for your answer! cheers, Dominic On Wed, Oct 12, 2011 at 3:43 PM, Paul Davis paul.joseph.da...@gmail.com wrote: On Tue, Oct 11, 2011 at 11:17 PM, Dominic Tarr dominic.t...@gmail.com wrote: I read here https://plus.google.com/107397941677313236670/posts/LFBB233PKQ1 that UBJ was under consideration? The spec for ubj looks quite good, but there are not many implementations available yet. I am considering writing some in (node)js and erlang. would support for ubj be added if there where supporting libraries available. I have searched the archives and not found any discussion on this yet. cheers, Dominic. Dominic, We've discussed the possibility (possibility mind you) of adding it. As you note it's quite young and hasn't reached wide spread adoption and its always hard to guess about the future. The current status is basically that it'd be interesting to see a patch to CouchDB that abstracts the content generation for the JSON responses such that alternative content types could be implemented along with a patch that uses that API for JSON and UBJSON. Then we'd need to look into the performance characteristics and end-user support (ie, perhaps it could be a plugin). I won't sugar coat it. It'll be a tough sell to get something other than JSON into CouchDB as an officially supported Content-Type. It'll take quite a bit of work marshaling the patch through the technical and community aspects of such a change. That said if you or anyone else wants to move forward with it, I would propose a rough outline. First, a patch that abstracts the JSON generation into a generic API as well as the JSON reference implementation of that API. Second, a robust system for choosing the correct content-type based on HTTP Accept headers. And third, an alternative implementation using UBJSON. I'll restate that this would be a significant amount of work and there's always the possibility that the end result doesn't include alternative content-types in trunk. (Assuming the abstract API was solid I don't see us not acception that patch, but I make no promises :) Other notes: There was a discussion here: http://mail-archives.apache.org/mod_mbox/couchdb-dev/201109.mbox/%3CCABn9xAGK0WbPhpkbxyKQo9N1KEsm=rx81v7fvdargpjxjnc...@mail.gmail.com%3E Also, if you're interested in writing an Erlang version, I'd be interested in a patch to Jiffy [1]. I've been meaning to sit down and write it but its not a priority. [1] https://github.com/davisp/jiffy
[jira] [Commented] (COUCHDB-1302) Fix couchjs
[ https://issues.apache.org/jira/browse/COUCHDB-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13121871#comment-13121871 ] Riyad Kalla commented on COUCHDB-1302: -- Now is the perfect opportunity to make this change, once 1.2/2.0 goes out the door it likely couldn't be safely addressed again until 3.0 and the uptake in CouchDB (from what I see at least) seems to be on the rapid rise. Making this change in 1.2/2.0 becomes an education issue, so maybe some questions around how to be *really* helpful explaining the situation to the user, e.g. errors in the server log when the problem is encountered and exactly how to work around it pointing at a Wiki page with clarification explaining why; maybe warnings during startup when the server does a quick scan for the issue (that can be disabled if the user knows they have a good setup); Sure it will be painful, but trying to monkey-patch this until 3.0 (or beyond) will be much more painful and have a lot more negative impact by that time. I like Jason's suggestion of some quick-patch tool that might ship with the install that attempts the fix automatically or at least gives the users enough tools and help/information so as to get them from Point A to B as quickly as possible. Fix couchjs --- Key: COUCHDB-1302 URL: https://issues.apache.org/jira/browse/COUCHDB-1302 Project: CouchDB Issue Type: Improvement Components: JavaScript View Server Affects Versions: 1.1.1, 1.2, 1.3 Reporter: Paul Joseph Davis Priority: Blocker Figure out why some spidermonkeys have an error when doing: eval(function(){}) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Universal Binary JSON in CouchDB
Hey Randall, This is something that Paul and I discussed on IRC. The way UBJ is written out looks something like this ([] blocks are just for readability): [o][2] [s][4]name[s][3][bob] [s][3][age][i][31] Couch can easily prepend or append its own dynamic content in a reply. If it wants to prepend some information after the object header, the header would need to be stored and manipulated by couch separately. For example, if I upload the doc above, Couch would want to take that root object header of: [o][2] and change it to: [o][4] before storing it because of the additions of _id and _rev. Actually this could be as simple as storing a rootObjectCount and have couch dynamically generate the root every time. 'o' represents object containers with = 254 elements (1 byte for length) and 'O' represents object containers with up to 2.1 billion elements (4 byte int). If couch did that any request coming into the server might look like this: - client request -- (server loads root object count) - server writes back object header: [o][4] -- (server calculates dynamic data) - server writes back dynamic content - server streams raw record data straight off disk to client (no deserialization) -- (OPT: server calculates dynamic data) -- OPT: server streams dynamic data appended Thoughts? Best, Riyad P.S. There is support in the spec for unbounded container types when couch doesn't know how much it is streaming back, but that isn't necessary for retrieving stored docs (but could be handy when responding to view queries and other requests whose length is not known in advance) On Tue, Oct 4, 2011 at 12:02 PM, Randall Leeds randall.le...@gmail.comwrote: Hey, Thanks for this thread. I've been interested in ways to reduce the work from disk to client as well. Unfortunately, the metadata inside the document objects is variable based on query parameters (_attachments, _revisions, _revs_info...) so the server needs to decode the disk binary anyway. I would say this is something we should carefully consider for a 2.0 api. I know that, for simplicity, many people really like having the underscore prefixed attributes mixed in right alongside the document data, but a future API that separated these could really make things fly. -Randall On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau bchesn...@gmail.com wrote: On Thursday, September 29, 2011, Riyad Kalla rka...@gmail.com wrote: DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a rush, just check the last 2 sections and see if it sounds interesting. Hi everybody. I am new to the list, but a big fan of Couch and I have been working on something I wanted to share with the group. My appologies if this isn't the right venue or list ediquette... I wasn't really sure where to start with this conversation. Background = With the help of the JSON spec community I've been finalizing a universal, binary JSON format specification that offers 1:1 compatibility with JSON. The full spec is here (http://ubjson.org/) and the quick list of types is here (http://ubjson.org/type-reference/). Differences with existing specs and Why are all addressed on the site in the first few sections. The goal of the specification was first to maintain 1:1 compatibility with JSON (no custom data structures - like what caused BSON to be rejected in Issue #702), secondly to be as simple to work with as regular JSON (no complex data structures or encoding/decoding algorithms to implement) and lastly, it had to be smaller than compacted JSON and faster to generate and parse. Using a test doc that I see Filipe reference in a few of his issues (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following compression: * Compacted JSON: 3,861 bytes * Univ. Binary JSON: 3,056 bytes (20% smaller) In some other sample data (e.g. jvm-serializers sample data) I see a 27% compression with a typical compression range of 20-30%. While these compression levels are average, the data is written out in an unmolested format that is optimized for read speed (no scanning for null terminators) and criminally simple to work with. (win-win) I added more clarifying information about compression characteristics in the Size Requirements section of the spec for anyone interested. Motivation == I've been following the discussions surround a native binary JSON format for the core CouchDB file (Issue #1092) which transformed into keeping the format and utilizing Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50% reduction in file size at the cost of running the compression/decompression on every read/write. I realize in light of the HTTP transport and JSON encoding/decoding cycle
Re: Universal Binary JSON in CouchDB
Tim's work is certainly the catalyst for my excitement about the potential here for Couch. As Paul pointed out, the correct discussion to have at this point is really about do we support a binary format for responses and if so which one? That discussion could go on for an eternity with everyone voting for their favorite (protobuff, smile, messagepack, etc.). The only reason I bring up the disk store format discussion into this conversion is to offer a hat-tip to a future where a binary response format selected now may dovetail nicely with alternative binary disk formats, enabling the stream-directly-from-disk scenario. If we were to hypothetically remove the possibility of the on-disk format ever changing, then I suppose the decision of binary response format just becomes an issue of Which one is fast and easy to generate?. -R On Tue, Oct 4, 2011 at 12:49 PM, Ladislav Thon ladi...@gmail.com wrote: That said, the ubjson spec is starting to look reasonable and capable to be an alternative content-type produced by CouchDB. If someone were to write a patch I'd review it quite enthusiastically. Just FYI, there's an experimental support for MessagePack by Tim Anglade: https://github.com/timanglade/couchdb/commits/msgpack I thought it might be interesting in this debate... Tim says it improves performance quite a bit: http://blog.cloudant.com/optimizing-couchdb-calls-by-99-percent/ (Tim, if you're reading this, thank's for the excellent talk!) LT
Universal Binary JSON in CouchDB
DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a rush, just check the last 2 sections and see if it sounds interesting. Hi everybody. I am new to the list, but a big fan of Couch and I have been working on something I wanted to share with the group. My appologies if this isn't the right venue or list ediquette... I wasn't really sure where to start with this conversation. Background = With the help of the JSON spec community I've been finalizing a universal, binary JSON format specification that offers 1:1 compatibility with JSON. The full spec is here (http://ubjson.org/) and the quick list of types is here (http://ubjson.org/type-reference/). Differences with existing specs and Why are all addressed on the site in the first few sections. The goal of the specification was first to maintain 1:1 compatibility with JSON (no custom data structures - like what caused BSON to be rejected in Issue #702), secondly to be as simple to work with as regular JSON (no complex data structures or encoding/decoding algorithms to implement) and lastly, it had to be smaller than compacted JSON and faster to generate and parse. Using a test doc that I see Filipe reference in a few of his issues (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following compression: * Compacted JSON: 3,861 bytes * Univ. Binary JSON: 3,056 bytes (20% smaller) In some other sample data (e.g. jvm-serializers sample data) I see a 27% compression with a typical compression range of 20-30%. While these compression levels are average, the data is written out in an unmolested format that is optimized for read speed (no scanning for null terminators) and criminally simple to work with. (win-win) I added more clarifying information about compression characteristics in the Size Requirements section of the spec for anyone interested. Motivation == I've been following the discussions surround a native binary JSON format for the core CouchDB file (Issue #1092) which transformed into keeping the format and utilizing Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50% reduction in file size at the cost of running the compression/decompression on every read/write. I realize in light of the HTTP transport and JSON encoding/decoding cycle in CouchDB, the Snappy compression cycles are a very small part of the total time the server spends working. I found this all interesting, but like I said, I realized up to this point that Snappy wasn't any form of bottleneck and the big compression wins server side were great so I had nothing to contribute to the conversation. Catalyst == This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD) and started to foam at the mouth when I saw his slides where he skipped the JSON encode/decode cycle server-side and just generated straight from binary on disk into MessagePack and got some phenomenal speedups from the server: http://i.imgscalr.com/XKqXiLusT.png I pinged Tim to see what the chances of adding Univ Binary JSON support was and he seemed ameanable to the idea as long as I could hand him an Erlang or Ruby impl (unfortunately, I am not familiar with either). ah-HA! moment == Today it occurred to me that if CouchDB were able to (at the cost of 20% more disk space than it is using with Snappy enabled, but still 20% *less* than before Snappy was integrated) use the Universal Binary JSON format as its native storage format AND support for serving replies using the same format was added (a-la Tim's work), this would allow CouchDB to (theoretically) reply to queries by pulling bytes off disk (or memory) and immediately streaming them back to the caller with no intermediary step at all (no Snappy decompress, no Erlang decode, no JSON encode). Given that the Univ Binary JSON spec is standard, easy to parse and simple to convert back to JSON, adding support for it seemed more consistent with Couch's motto of ease and simplicity than say MessagePack or Protobuff which provide better compression but at the cost of more complex formats and data types that have no ancillary in JSON. I don't know the intracacies of Couch's internals, if that is wrong and some Erlang manipulation of the data would still be required, I believe it would still be faster to pull the data off disk in the Univ Binary JSON format, decode to Erlang native types and then reply while skipping the Snappy decompression step. If it *would* be possible to stream it back un-touched directly from disk, that seems like an enhancement that could potentially speed up CouchDB by at least an order of magnitude. Conclusion === I would appreciate any feedback on this idea from you guys with a lot more knowledge of the internals. I have no problem if this is a horrible idea and never going to happen, I just wanted to try and contribute something back. Thank you all for reading.