Re: [ANNOUNCE] Alexander Shorin joins the PMC

2014-02-22 Thread Riyad Kalla
Congrats Alexander; a very passionate and intelligent engineer that CouchDB
is lucky to have engaged.


On Fri, Feb 21, 2014 at 1:08 PM, Noah Slater nsla...@apache.org wrote:

 Dear community,

 I am delighted to announce that Alexander Shorin joins the Apache
 CouchDB Project Management Committee today.

 Alexander has made outstanding, sustained contributions to the
 project. This appointment is an official acknowledgement of his
 position within the community, and our trust in his ability to provide
 oversight for the project.

 Everybody, please join me in congratulating Alexander!

 On behalf of the CouchDB PMC,

 --
 Noah Slater
 https://twitter.com/nslater



[jira] [Commented] (COUCHDB-1754) Implement ubjson support

2013-03-28 Thread Riyad Kalla (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616229#comment-13616229
 ] 

Riyad Kalla commented on COUCHDB-1754:
--

There might be two interesting SOC projects in relation to UBJSON:

1. Supporting UBJSON over-the-wire (I assume that is what this ticket is for?)
2. Supporting UBJSON as the on-disk binary data format for the CouchDB DB file, 
allowing records to be streamed directly off of disk with no serialization 
step. At least this was the original intent of UBJSON.

I know there are likely a boat load of details around #2 that I am not aware of 
that might make that an unlikely SOC project, but just wanted to throw it out 
there incase anyone was interested. I've always been curious to see what kind 
of performance water-level Couch could hit if we could remove the 
serialization/deserialization steps.

 Implement ubjson support
 

 Key: COUCHDB-1754
 URL: https://issues.apache.org/jira/browse/COUCHDB-1754
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
Reporter: Paul Joseph Davis
  Labels: gsoc2013,, http

 Just remembered this thing. For a GSOC project it might prove useful as a 
 what would it take to support alternative data formats that are 
 representable as JSON structures.
 http://ubjson.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (COUCHDB-1754) Implement ubjson support

2013-03-28 Thread Riyad Kalla (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616242#comment-13616242
 ] 

Riyad Kalla commented on COUCHDB-1754:
--

Fair point; I think we could profile couch under load and find a rough idea of 
what the serialization costs (as far as a perf payoff), but certainly 
understand if we are just going to be in there hacking up the world, let's fix 
it for everything instead of just this 1-off.

Certainly agree with that.

 Implement ubjson support
 

 Key: COUCHDB-1754
 URL: https://issues.apache.org/jira/browse/COUCHDB-1754
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
Reporter: Paul Joseph Davis
  Labels: gsoc2013,, http

 Just remembered this thing. For a GSOC project it might prove useful as a 
 what would it take to support alternative data formats that are 
 representable as JSON structures.
 http://ubjson.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Unique instance IDs?

2012-02-02 Thread Riyad Kalla
Robert, to your point, when would you copy an entire DB to another live
server and *not* have the intention of the two be replicas of each other?

The UUID would be akin to a social-security-number for the docs, so if I
have a doc A in Server1 with a SSN of 333-90-3231, and that same doc
existed on another server, say Servers 2 and 3, I would *expect* that any
doc claiming that social security number would be making a claim of
identity with any other doc with the same SSN.

The same way you might resolve a record for someone at the DMV along with
the IRS to establish they are the same person.

The same way two docs in two different locations with the same ID are an
indicator of identity. Especially with a UUID (i.e. the chances of that
equality existing by chance as compared to a sequential value are
astronomically small).

Seems like UUIDs are an excellent way to go for doc IDs if you need them
(and not some more domain-specific ID).

On Thu, Feb 2, 2012 at 8:43 AM, Kevin R. Coombes
kevin.r.coom...@gmail.comwrote:

 I don't think it's such a slam-dunk, but maybe I'm guilty of a
 goal-tending violation.

 It depends on whether you intend to keep replicating the two copies to
 keep them synchronized.  If you want that (and most of the time, I do),
 then I think it's an advantage to keep the same UUID.  The (matching)
 internal UUIDs indicates that you think of these two things as the same
 abstract entity.  The full URL distinguishes the physical copies.

 If on the other hand, you expect the copies to evolve into something
 differently over time, then I can see how you might want different UUIDs.
  But again, the full, easy-to-construct URL distinguishes them.

 As far as I can tell, the existing system still lets you have it both
 ways


 On 2/2/2012 8:45 AM, Robert Newson wrote:

 ... until you copy the database (and its uuid) and have two databases
 with the same uuid. This has always been the slam-dunk argument
 against database uuid's.

 B.

 On 2 February 2012 09:41, Kevin R. 
 Coombeskevin.r.coombes@gmail.**comkevin.r.coom...@gmail.com
  wrote:

 For CouchDB, I think UUIDs are clearly the way to go.  After all, given
 the
 UUID, database,  and hostname, you can construct the desired URL
 directly by
 forming
http://hostname:5984/database/**UUIDhttp://hostname:5984/database/UUID
 As Noah points out, if you used this entire URL as the identifier (by
 which
 I assume he means the _id field), then you would lose the ability to copy
 the document elsewhere.  This would, of course, break replication
 completely.

 Keeping the UUIDs as they are gives the best of both worlds.  Easy
 replication, and (as long as the database is hosted at the same place) an
 easy way for humans and programs to construct stable URIs or URLs that
 point
 to each document.

-- Kevin


 On 1/22/2012 12:44 PM, Noah Slater wrote:

 Sorry to bump this old thread, but just going through my backlog.

 With regard to URLs, I think there is some confusion about the purpose
 of
 a
 URL here.

 If I write a a cool essay, say, and I stick that up at
 nslater.org/my-cool-essay, then I can link to it from other places on
 the
 web using that address. I might also want to put my cool essay on
 Dropbox,
 or post it to Tumblr, or send it in an email. Now my cool essay has lots
 of
 URLs. Each one of them perfectly valid. I don't have to go and edit the
 original copy at nslater.org/my-cool-essay, because I am making copies
 of
 it. My cool essay is completely unaware of the URLs that are being used
 to
 point to it. And it doesn't care that many URLs point to it.

 Yes, URLs can be used as identifiers. But when you do this, you tie the
 thing you're naming to the place you're hosting it. Sometimes that is
 useful, other times it will cripple you. There is nothing about URLs
 that
 requires you to do this. I would hazard a guess that 99% of URLs are
 de-coupled from the things they point to. WebArch is much more robust
 when
 the identity of the object is de-coupled from the URL. Look at Atom, the
 ID
 element is supposed to be a URL, but they recommend a non-dereferencable
 format, precisely to decouple posts from the location you happen to be
 hosting them this month.

 Hey, if we're gonna use URLs, maybe we want to go down the same route?

 http://en.wikipedia.org/wiki/**Tag_URIhttp://en.wikipedia.org/wiki/Tag_URI


 At this point, I'm not sure what they buy us over UUIDs.

 Thoughts?

 Thanks,

 N




Re: How are key values stored in B+ tree nodes on-disk?

2012-01-23 Thread Riyad Kalla
resending, Gmail is telling me this never sent...


Randall and Paul, I appreciate the explanation and links to the exact spots
in code. This helped a lot.

I've been reading up on alternative database impls; InnoDB, for example,
caps the key length at some upper bound and then truncates the key and puts
the remainder in an overflow table; naturally this requires 1 more random
I/O every time you match a long key. At the extreme case (where every key
doesn't fit and parts of it are in the overflow table) I think you end up
with key-table idea you were toying with Paul, or something related.

Paul, I feel your pain re: alternative approaches to this. If you can do
away with RANGE queries, moving to MD5 or SHA hashes (or something even
faster like MurmurHash) is a nice way to get consistently sized keys which
gives you the advantage of paging in a block in raw bytes and only
deserializing the 16 or 20-byte chunks in the positions of a standard
binary-search until you find what you want instead of the whole node.

I *really* want to find a way to make that work, but people seem to like
RANGE queries.

I asked this question on SO and a gentleman suggested a header at the
beginning of the block that actually lists the offsets of all the keys in
the node. For node/block sizes of 32KB of smaller (which seem reasonable),
you could safely get away with a 16-bit int for storing the offsets;
assuming a typical key size, I imagine you wouldn't have more than 100 or
200 keys in a node; assume 150, that is 300 bytes worth of offsets at the
beginning of the block. So we spend 7% (assuming 4096kb block size) of the
block providing offsets to help make key searching faster. If you were
willing to cap blocks to no more than 255 keys you could use 8-bit offset
values (meaning any key smaller than 8 bytes would get you in trouble).

A limitation to this idea though is that when you need to overflow a single
block because you are indexing so many or such large keys, you now end up
with two blocks next to each other with 1 of them having no header.

You can fix this by having a hard-set-block-size for the DB (say 4KB) that
never changes and can never be configured. That way every block,
everywhere, has a very tiny type header; maybe 8-bit. If the normal bit
is set, then that block would have an offset list at the beginning of it
before the key data. If the overflow bit was set, than that block is a
continuation of the block before it, and right after the 8-bit block header
is raw key data that needs to be considered.

Just an idea, I am sure there are more elegant ways to solve this. I am
more curious if there is something here... something interesting and worth
considering, or if I am just enjoying the smell of my own farts over here ;)

-R

On Mon, Jan 23, 2012 at 1:39 AM, Paul Davis paul.joseph.da...@gmail.comwrote:

 On Mon, Jan 23, 2012 at 2:26 AM, Randall Leeds randall.le...@gmail.com
 wrote:
  On Sun, Jan 22, 2012 at 04:49, Riyad Kalla rka...@gmail.com wrote:
  In my ever-growing need to ask the most esoteric file-format questions
  possible, I am curious if anyone can point me in the direction of the
  answer here...
 
  Background
  -
  Consider the _id and _seq B+ indices whose change-paths are appended to
 the
  end of the Couch data file after every update. Each of the nodes written
  have a basic structure of [key, value] where the key is the _id or
 _seq
  value being indexed, and the value is an offset pointer to the next
 node
  in the tree to seek to (or in the case of a leaf, an offset directly
 into
  the datafile where the updated record lives).
 
  General
  ---
  Efficiently searching the keys stored in a node of a B+ tree after it
 has
  been paged in require (let's assume) that the values are in sorted
 order;
  then the keys can be binary-searched to find the next offset or block ID
  on-disk to jump to.
 
  To be able to efficiently jump around key values like that, they must
 all
  be of the same size. For example, let's say we store 133 keys per node
  (assume it is full). That means our first comparison after paging in
 that
  block should be at the 67th key. The only way I can jump to that
 position
  is to know every key value is say 16 bytes and then jump forward 67 *
 16 =
  1072 bytes before decoding the key from raw bytes and comparing.
 
  Question
  -
  Given the massive potential for document _id values to vary in size, it
  seems impossible to have a hard-coded key size that couch utilizes
 here.
  Even something sufficiently large like 32 or 64 bytes would still run
 into
  situations where a longer key was being indexed or a *very short* key is
  being indexed and there are multiple magnitudes of wasted space in the
  index.
 
  The few solutions I've found to variable-length keys online require a
  sequential processing of the keys in a node because the keys will be
  written like:
  [key length][key data][offset]
 
  This is certainly doable

How are key values stored in B+ tree nodes on-disk?

2012-01-22 Thread Riyad Kalla
In my ever-growing need to ask the most esoteric file-format questions
possible, I am curious if anyone can point me in the direction of the
answer here...

Background
-
Consider the _id and _seq B+ indices whose change-paths are appended to the
end of the Couch data file after every update. Each of the nodes written
have a basic structure of [key, value] where the key is the _id or _seq
value being indexed, and the value is an offset pointer to the next node
in the tree to seek to (or in the case of a leaf, an offset directly into
the datafile where the updated record lives).

General
---
Efficiently searching the keys stored in a node of a B+ tree after it has
been paged in require (let's assume) that the values are in sorted order;
then the keys can be binary-searched to find the next offset or block ID
on-disk to jump to.

To be able to efficiently jump around key values like that, they must all
be of the same size. For example, let's say we store 133 keys per node
(assume it is full). That means our first comparison after paging in that
block should be at the 67th key. The only way I can jump to that position
is to know every key value is say 16 bytes and then jump forward 67 * 16 =
1072 bytes before decoding the key from raw bytes and comparing.

Question
-
Given the massive potential for document _id values to vary in size, it
seems impossible to have a hard-coded key size that couch utilizes here.
Even something sufficiently large like 32 or 64 bytes would still run into
situations where a longer key was being indexed or a *very short* key is
being indexed and there are multiple magnitudes of wasted space in the
index.

The few solutions I've found to variable-length keys online require a
sequential processing of the keys in a node because the keys will be
written like:
[key length][key data][offset]

This is certainly doable, but less than optimal.


I am curious how CouchDB handles this?

Thank you for any pointers.

-R


Re: RFC: Releasing 1.2.0

2012-01-09 Thread Riyad Kalla
I know this discussion took place sometime after the last release; but was
it decided which platform binaries would be provided for the  upcoming 1.2
release to help ease adoption?

-R

On Mon, Jan 9, 2012 at 4:24 PM, Noah Slater nsla...@tumbolia.org wrote:

 If a bug is marked as fixed in X version, does that not provide
 this possibility?

 Other than that, are we ready to go?

 On Mon, Jan 9, 2012 at 10:19 PM, Filipe David Manana fdman...@apache.org
 wrote:

  On Mon, Jan 9, 2012 at 9:40 PM, till t...@php.net wrote:
   Just looked at CHANGES, a few questions:
 
  Hi Till, answers below.
 
  
   1) Can anyone go into detail on the replicator? (E.g. where are the
   configuration options.)
 
  * https://issues.apache.org/jira/browse/COUCHDB-1024
  * http://s.apache.org/KsY
  *
 
 http://wiki.apache.org/couchdb/Replication#New_features_introduced_in_CouchDB_1.2.0
 
  There's more stuff out there I don't recall at the moment.
 
  
   2) Can anyone mention how snappy/zlib impacts performance?
 
  * https://issues.apache.org/jira/browse/COUCHDB-1120
 
  
   Most of the stuff is a little vague for my taste – maybe the Jira
 issues
   explain the additions better. If so, should they be linked always?
 
  Personally, I would like to link a list of jira tickets to each release.
  I don't recall if this possibility was ever discussed before.
 
  
   Till
  
   On Mon, Jan 9, 2012 at 10:34 PM, till t...@php.net wrote:
  
   Can haz release? :)
  
  
   On Fri, Jan 6, 2012 at 11:34 PM, Randall Leeds 
 randall.le...@gmail.com
  wrote:
  
   On Fri, Jan 6, 2012 at 13:22, Robert Newson rnew...@apache.org
  wrote:
I'm checking it now. Jan did his JIRA dance, so I think we're clean
  on
that end. I'm looking into whether all the fixes got backported,
 etc,
and I'll also read over the full 1.1.x-1.2.x diff.
  
   Just checked over my commits and everything appropriate is
 back-ported.
  
   Good to go from my end. +1 on calling the vote.
  
   
B.
   
On 6 January 2012 21:19, Noah Slater nsla...@tumbolia.org wrote:
Hey chaps,
   
Let's pull our finger out and get 1.2.0 shipped.
   
Do we have everything in trunk that we want in 1.2.0, or are there
  any
patches y'all are sitting on?
   
Also, can you all check NEWS and CHANGES to make sure they are up
 to
   date?
   
Thanks all,
   
N
  
  
  
 
 
 
  --
  Filipe David Manana,
 
  Reasonable men adapt themselves to the world.
   Unreasonable men adapt the world to themselves.
   That's why all progress depends on unreasonable men.
 



Re: Understanding the CouchDB file format

2011-12-22 Thread Riyad Kalla
Jan,

Thank you and yes, I'd be happy to contribute this to the wiki.

I made some edits early this morning after some feedback. If a few folks
in-the-know could give it a quick read-through to make sure I didn't get
anything wrong then I'm happy to work it up on the wiki as well (or send it
along to the wiki's editor for inclusion).

Just point the way.

Best,
Riyad

On Thu, Dec 22, 2011 at 2:21 AM, Jan Lehnardt j...@apache.org wrote:

 Good writeup! Would you consider contributing it to the CouchDB Wiki?

  http://wiki.apache.org/couchdb/

 Cheers
 Jan
 --

 On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:

  Bob,
 
  Really appreciate the link; Rick has a handful of articles that helped a
  lot.
 
  Along side all the CouchDB reading I've been looking at SSD-optimized
 data
  storage mechanisms and tried to coalesce all of this information into
 this
  post on Couch's file storage format:
  https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
 
  It is uncanny how many things Couch seems to have gotten right with
 regard
  to existing storage systems and future flash-based storage systems. I'd
  appreciate any corrections, additions or feedback to the post for anyone
  interested.
 
  Best,
  R
 
  On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne 
  dio...@dionne-associates.com wrote:
 
  I think this is largely correct Riyad, I dug out an old article[1] by
 Rick
  Ho that you may also find helpful though it might be slightly dated.
  Generally the best performance will be had if the ids are sequential and
  updates are done in bulk. Write heavy applications will eat up a lot of
  space and require compaction. At the leaf nodes what are stored are
 either
  full_doc_info records or doc_info records which store pointers to the
 data
  so the main thing that impacts the branching at each level are the key
 size
  and in the case of views the sizes of the reductions as these are stored
  with the intermediate nodes.
 
  All in all it works pretty well but as always you need to test and
  evaluate it for you specific case to see what the limits are.
 
  Regards,
 
  Bob
 
 
  [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
 
 
 
 
  On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
 
  Adding to this conversation, I found this set of slides by Chris
  explaining
  the append-only index update format:
  http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
 
  Specifically slides 16, 17 and 18.
 
  Using this example tree, rewriting the updated path (in reverse order)
  appended to the end of the file makes sense... you see how index
 queries
  can simply read backwards from the end of the file and not only find
 the
  latest revisions of docs, but also every other doc that wasn't touched
  (it
  will just seek into the existing inner nodes of the b+ tree for
  searching).
 
  What I am hoping for clarification on is the following pain-point that
 I
  perceive with this approach:
 
  1. In a sufficiently shallow B+ tree (like CouchDB), the paths
 themselves
  to elements are short (typically no more than 3 to 5 levels deep) as
  opposed to a trie or some other construct that would have much longer
  paths
  to elements.
 
  2. Because the depth of the tree is so shallow, the breadth of it
 becomes
  large to compensate... more specifically, each internal node can have
  100s,
  1000s or more children. Using the example slides, consider the nodes
  [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
  index nodes would have 100s (or more) elements in them pointing at
 deeper
  internal nodes that themselves had thousands of elements; instead of
 the
  13
  or so as implied by [A...M].
 
  3. Looking at slide 17 and 18, where you see the direct B+ tree path to
  the
  update node getting appended to the end of the file after the revision
 is
  written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies
 that
  those internal nodes with *all* their child elements are getting
  rewritten
  as well.
 
  In this example tree, it is isn't such a big issue... but in a
  sufficiently
  large CouchDB database, these nodes denoted by [A...M] and [A...Z]
 could
  be
  quite large... I don't know the format of the node elements in the B+
  tree,
  but it would be whatever the size of a node is times however many
  elements
  are contained at each level (1 for root, say 100 for level 2, 1000 for
  level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
 on
  here, of course it depends on the size of the data store).
 
  Am I missing something or is CouchDB really rewriting that much index
  information between document revisions on every update?
 
  What was previously confusing me is I thought it was *only* rewriting a
  direct path to the updated revision, like [B][E][J'] and Couch was
  some-how patching in that updated path info to the B+ index at runtime.
 
  If couch is rewriting entire node paths with all their elements then I
 am

Re: Understanding the CouchDB file format

2011-12-22 Thread Riyad Kalla
On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson rnew...@apache.org wrote:

 It reads well as an article but needs some polish before it could be a
 great wiki page. I suggest that if it does go up, it is clearly marked
 as a draft, and we all chip in to sculpt it into shape.


Great idea. Agreed that it is no where as clean/factual as it would need to
be for an appropriate wiki/ref-doc-style entry, but if it can be massaged
into that, it would hopefully make a great resource.


 There are a
 few parts of the article that are inaccurate (like the assertion we
 have good locality for the id and seq trees. If this were true we
 wouldn't have seen such a huge improvement in compaction by
 temporarily separating them).


I'd look forward to more detail on this... it was my understanding the
updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
format at the end of the data file. Sounds like I may have misunderstood
that?


 The 'COMPETE recreation' paragraph also
 strikes me as factually awry.


I'd appreciate a detailed correction on this if it is wrong; all the
digging I've done (in this thread and other partial resources) suggests
that the path from the changed doc ref back to the root (including a copy
of all internal nodes and all of their child references) is written so as
being able to read-back into the index from the tail of the data file
quickly... specifically slides 17, 18 and 19 from this slidedeck (
http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note that
the interim nodes [A..M] and [A..Z] are rewritten (including any and all
child pointers they contained).

This is what I was referring to; I should either clean up my wording
(confusing) or I got it wrong in which case I'd appreciate any and all
corrections.

Thanks for helping move this forward and the feedback Robert.

-Riyad


Re: Understanding the CouchDB file format

2011-12-22 Thread Riyad Kalla
Randall,

Spot on; we are on the same page now. I'll go through the post again
tomorrow morning and reword it a bit so the thoughts aren't so fragmented.

Best,
Riyad

P.S. Are there discussions anywhere on alternative file formats for the
indices that the Couch community has considered in the past? (old JIRA or
mailing list discussions) or has this file format been the only approach
from the initial work started in 05/06?

The reason I ask is because of the index fragmentation you mentioned that
can occur over time... I'd be curious if an in-place-edited format for the
indices as separate file(s) from the append-only document revisions would
yield better performance over time. The splits would still be expensive,
but you'd be able to pre-allocate your blocks for each node on each split
to help a bit. Even leveraging an append-only, pre-allocated approach to
the index might work where split nodes are appended to the end of the index
file with the full node size (e.g. 133 elements) pre-allocated and the
index marked ready for compaction or something.

So it would be a hybrid approach... when splits aren't needed, you do an
in-place edit to the index on-disk. If a split is needed, you use a
technique similar to the one now where the effected node hierarchy is
appended to the end of the file.

You still run the risk of costly index rebuilds in the cast of a failed
in-place edit, but you wouldn't run the risk of any data loss... I am just
wondering for large data sets if this would yield some significant benefits
putting Couch somewhere between a Mongo and Couch (performance wise).

Just pie-in-the-sky thinking... I am sure the senior team here has talked
this stuff to death years ago. My apologies if this is re-treading road
covered in beaten horse bodies ;)

I just find this category of problems interesting.


On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds randall.le...@gmail.comwrote:

 On Thu, Dec 22, 2011 at 12:11, Riyad Kalla rka...@gmail.com wrote:
  On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson rnew...@apache.org
 wrote:
 
  There are a
  few parts of the article that are inaccurate (like the assertion we
  have good locality for the id and seq trees. If this were true we
  wouldn't have seen such a huge improvement in compaction by
  temporarily separating them).
 
 
  I'd look forward to more detail on this... it was my understanding the
  updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
  format at the end of the data file. Sounds like I may have misunderstood
  that?
 

 Riyad, as you pointed out earlier, all the inner nodes are rewritten
 up to the root. The two btrees are not written in parallel, though,
 which means that for deep trees all the updated nodes are written
 before the other tree's nodes are written. Also remember that the
 trees themselves end up pretty fragmented since older nodes that
 haven't changed are back toward the beginning of the file. In general,
 I'm not sure there's much that's useful to mention about locality in
 the trees. Also, updating these trees requires reading the old values,
 so there is still seeking that occurs (if the pages aren't cached by
 the OS).

 
  The 'COMPETE recreation' paragraph also
  strikes me as factually awry.
 
 
  I'd appreciate a detailed correction on this if it is wrong; all the
  digging I've done (in this thread and other partial resources) suggests
  that the path from the changed doc ref back to the root (including a copy
  of all internal nodes and all of their child references) is written so as
  being able to read-back into the index from the tail of the data file
  quickly... specifically slides 17, 18 and 19 from this slidedeck (
  http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note
 that
  the interim nodes [A..M] and [A..Z] are rewritten (including any and all
  child pointers they contained).
 
  This is what I was referring to; I should either clean up my wording
  (confusing) or I got it wrong in which case I'd appreciate any and all
  corrections.

 Right. It mostly seems a bit confusing to me.
 it DOES NOT just rewrite the nodes pathing from the leaf to the node
 and ONLY the connections for that single document
 That doesn't sound quite right, but I can tell what you're trying to
 say is accurate. If I'm right, you mean that every changed inner node
 is rewritten in its entirety rather than having a single pointer to
 the new child updated in place.

 Cheers. Thanks for taking the time to write this up.

 -Randall



Re: Understanding the CouchDB file format

2011-12-21 Thread Riyad Kalla
Adding to this conversation, I found this set of slides by Chris explaining
the append-only index update format:
http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed

Specifically slides 16, 17 and 18.

Using this example tree, rewriting the updated path (in reverse order)
appended to the end of the file makes sense... you see how index queries
can simply read backwards from the end of the file and not only find the
latest revisions of docs, but also every other doc that wasn't touched (it
will just seek into the existing inner nodes of the b+ tree for searching).

What I am hoping for clarification on is the following pain-point that I
perceive with this approach:

1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
to elements are short (typically no more than 3 to 5 levels deep) as
opposed to a trie or some other construct that would have much longer paths
to elements.

2. Because the depth of the tree is so shallow, the breadth of it becomes
large to compensate... more specifically, each internal node can have 100s,
1000s or more children. Using the example slides, consider the nodes
[A...M] and [R...Z] -- in a good sized CouchDB database, those internal
index nodes would have 100s (or more) elements in them pointing at deeper
internal nodes that themselves had thousands of elements; instead of the 13
or so as implied by [A...M].

3. Looking at slide 17 and 18, where you see the direct B+ tree path to the
update node getting appended to the end of the file after the revision is
written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that
those internal nodes with *all* their child elements are getting rewritten
as well.

In this example tree, it is isn't such a big issue... but in a sufficiently
large CouchDB database, these nodes denoted by [A...M] and [A...Z] could be
quite large... I don't know the format of the node elements in the B+ tree,
but it would be whatever the size of a node is times however many elements
are contained at each level (1 for root, say 100 for level 2, 1000 for
level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
here, of course it depends on the size of the data store).

Am I missing something or is CouchDB really rewriting that much index
information between document revisions on every update?

What was previously confusing me is I thought it was *only* rewriting a
direct path to the updated revision, like [B][E][J'] and Couch was
some-how patching in that updated path info to the B+ index at runtime.

If couch is rewriting entire node paths with all their elements then I am
no longer confused about the B+ index updates, but am curious about the
on-disk cost of this.

In my own rough insertion testing, that would explain why I see my
collections absolutely explode in size until they are compacted (not using
bulk insert, but intentionally doing single inserts for a million(s) of
docs to see what kind of cost the index path duplication would be like).

Can anyone confirm/deny/correct this assessment? I want to make sure I am
on the right track understanding this.

Best wishes,
Riyad

On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla rka...@gmail.com wrote:

 @Filipe - I was just not clear on how CouchDB operated; you and Robert
 cleared that up for me. Thank you.

 @Robert - The writeup is excellent so far (I am not familiar with erlang,
 so there is a bit of stickiness there), thank you for taking the time to
 put this together!

 At this point I am curious how the _id and _seq indices are read as their
 data is continually appended to the end of the data file in small
 diff-trees for every updated doc.

 If CouchDB kept all the indices in-memory and simply patched-in the
 updated paths at runtime (maybe something akin to memory-mapped indices in
 MongoDB) I would be fairly clear on the operation... but as I understand
 it, CouchDB keeps such a small memory footprint by doing no in-memory
 caching and relying on the intelligence of the OS and filesystem (and/or
 drives) to cache frequently accessed data.

 I am trying to understand the logic used by CouchDB to answer a query
 using the index once updates to the tree have been appended to the data
 file... for example, consider a CouchDB datastore like the one Filipe
 has... 10 million documents and let's say it is freshly compacted.

 If I send in a request to that Couch instance, it hits the header of the
 data file along with the index and walks the B+ tree to the leaf node,
 where it finds the offset into the data file where the actual doc lives...
 let's say 1,000,000 bytes away.

 These B+ trees are shallow, so it might look something like this:

 Level 1: 1 node, root node.
 Level 2: 100 nodes, inner child nodes
 Level 3: 10,000 nodes, inner child nodes
 Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
 the data file.

 Now let's say I write 10 updates to documents in that file. There are 10
 new revisions appended to the end of the data

Re: Understanding the CouchDB file format

2011-12-21 Thread Riyad Kalla
Bob,

Really appreciate the link; Rick has a handful of articles that helped a
lot.

Along side all the CouchDB reading I've been looking at SSD-optimized data
storage mechanisms and tried to coalesce all of this information into this
post on Couch's file storage format:
https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv

It is uncanny how many things Couch seems to have gotten right with regard
to existing storage systems and future flash-based storage systems. I'd
appreciate any corrections, additions or feedback to the post for anyone
interested.

Best,
R

On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne 
dio...@dionne-associates.com wrote:

 I think this is largely correct Riyad, I dug out an old article[1] by Rick
 Ho that you may also find helpful though it might be slightly dated.
 Generally the best performance will be had if the ids are sequential and
 updates are done in bulk. Write heavy applications will eat up a lot of
 space and require compaction. At the leaf nodes what are stored are either
 full_doc_info records or doc_info records which store pointers to the data
 so the main thing that impacts the branching at each level are the key size
 and in the case of views the sizes of the reductions as these are stored
 with the intermediate nodes.

 All in all it works pretty well but as always you need to test and
 evaluate it for you specific case to see what the limits are.

 Regards,

 Bob


 [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html




 On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:

  Adding to this conversation, I found this set of slides by Chris
 explaining
  the append-only index update format:
  http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
 
  Specifically slides 16, 17 and 18.
 
  Using this example tree, rewriting the updated path (in reverse order)
  appended to the end of the file makes sense... you see how index queries
  can simply read backwards from the end of the file and not only find the
  latest revisions of docs, but also every other doc that wasn't touched
 (it
  will just seek into the existing inner nodes of the b+ tree for
 searching).
 
  What I am hoping for clarification on is the following pain-point that I
  perceive with this approach:
 
  1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
  to elements are short (typically no more than 3 to 5 levels deep) as
  opposed to a trie or some other construct that would have much longer
 paths
  to elements.
 
  2. Because the depth of the tree is so shallow, the breadth of it becomes
  large to compensate... more specifically, each internal node can have
 100s,
  1000s or more children. Using the example slides, consider the nodes
  [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
  index nodes would have 100s (or more) elements in them pointing at deeper
  internal nodes that themselves had thousands of elements; instead of the
 13
  or so as implied by [A...M].
 
  3. Looking at slide 17 and 18, where you see the direct B+ tree path to
 the
  update node getting appended to the end of the file after the revision is
  written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies that
  those internal nodes with *all* their child elements are getting
 rewritten
  as well.
 
  In this example tree, it is isn't such a big issue... but in a
 sufficiently
  large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
 be
  quite large... I don't know the format of the node elements in the B+
 tree,
  but it would be whatever the size of a node is times however many
 elements
  are contained at each level (1 for root, say 100 for level 2, 1000 for
  level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
  here, of course it depends on the size of the data store).
 
  Am I missing something or is CouchDB really rewriting that much index
  information between document revisions on every update?
 
  What was previously confusing me is I thought it was *only* rewriting a
  direct path to the updated revision, like [B][E][J'] and Couch was
  some-how patching in that updated path info to the B+ index at runtime.
 
  If couch is rewriting entire node paths with all their elements then I am
  no longer confused about the B+ index updates, but am curious about the
  on-disk cost of this.
 
  In my own rough insertion testing, that would explain why I see my
  collections absolutely explode in size until they are compacted (not
 using
  bulk insert, but intentionally doing single inserts for a million(s) of
  docs to see what kind of cost the index path duplication would be like).
 
  Can anyone confirm/deny/correct this assessment? I want to make sure I am
  on the right track understanding this.
 
  Best wishes,
  Riyad
 
  On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla rka...@gmail.com wrote:
 
  @Filipe - I was just not clear on how CouchDB operated; you and Robert
  cleared that up for me

Re: Understanding the CouchDB file format

2011-12-21 Thread Riyad Kalla
Thank you Robert, fixed.

On Wed, Dec 21, 2011 at 1:42 PM, Robert Dionne dio...@dionne-associates.com
 wrote:

 Riyad,

 Your welcome. At a quick glance your post has one error, internal nodes do
 contain values (from the reductions). The appendix in the couchdb book also
 makes this error[1] which I've opened a ticket for.

 Cheers,

 Bob


 [1] https://github.com/oreilly/couchdb-guide/issues/450




 On Dec 21, 2011, at 3:28 PM, Riyad Kalla wrote:

  Bob,
 
  Really appreciate the link; Rick has a handful of articles that helped a
  lot.
 
  Along side all the CouchDB reading I've been looking at SSD-optimized
 data
  storage mechanisms and tried to coalesce all of this information into
 this
  post on Couch's file storage format:
  https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv
 
  It is uncanny how many things Couch seems to have gotten right with
 regard
  to existing storage systems and future flash-based storage systems. I'd
  appreciate any corrections, additions or feedback to the post for anyone
  interested.
 
  Best,
  R
 
  On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne 
  dio...@dionne-associates.com wrote:
 
  I think this is largely correct Riyad, I dug out an old article[1] by
 Rick
  Ho that you may also find helpful though it might be slightly dated.
  Generally the best performance will be had if the ids are sequential and
  updates are done in bulk. Write heavy applications will eat up a lot of
  space and require compaction. At the leaf nodes what are stored are
 either
  full_doc_info records or doc_info records which store pointers to the
 data
  so the main thing that impacts the branching at each level are the key
 size
  and in the case of views the sizes of the reductions as these are stored
  with the intermediate nodes.
 
  All in all it works pretty well but as always you need to test and
  evaluate it for you specific case to see what the limits are.
 
  Regards,
 
  Bob
 
 
  [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
 
 
 
 
  On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
 
  Adding to this conversation, I found this set of slides by Chris
  explaining
  the append-only index update format:
  http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
 
  Specifically slides 16, 17 and 18.
 
  Using this example tree, rewriting the updated path (in reverse order)
  appended to the end of the file makes sense... you see how index
 queries
  can simply read backwards from the end of the file and not only find
 the
  latest revisions of docs, but also every other doc that wasn't touched
  (it
  will just seek into the existing inner nodes of the b+ tree for
  searching).
 
  What I am hoping for clarification on is the following pain-point that
 I
  perceive with this approach:
 
  1. In a sufficiently shallow B+ tree (like CouchDB), the paths
 themselves
  to elements are short (typically no more than 3 to 5 levels deep) as
  opposed to a trie or some other construct that would have much longer
  paths
  to elements.
 
  2. Because the depth of the tree is so shallow, the breadth of it
 becomes
  large to compensate... more specifically, each internal node can have
  100s,
  1000s or more children. Using the example slides, consider the nodes
  [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
  index nodes would have 100s (or more) elements in them pointing at
 deeper
  internal nodes that themselves had thousands of elements; instead of
 the
  13
  or so as implied by [A...M].
 
  3. Looking at slide 17 and 18, where you see the direct B+ tree path to
  the
  update node getting appended to the end of the file after the revision
 is
  written (leaf to root ordering: [J' M] - [A M] - [A Z]) it implies
 that
  those internal nodes with *all* their child elements are getting
  rewritten
  as well.
 
  In this example tree, it is isn't such a big issue... but in a
  sufficiently
  large CouchDB database, these nodes denoted by [A...M] and [A...Z]
 could
  be
  quite large... I don't know the format of the node elements in the B+
  tree,
  but it would be whatever the size of a node is times however many
  elements
  are contained at each level (1 for root, say 100 for level 2, 1000 for
  level 3 and 10,000 for level 4 -- there is a lot of hand-waving going
 on
  here, of course it depends on the size of the data store).
 
  Am I missing something or is CouchDB really rewriting that much index
  information between document revisions on every update?
 
  What was previously confusing me is I thought it was *only* rewriting a
  direct path to the updated revision, like [B][E][J'] and Couch was
  some-how patching in that updated path info to the B+ index at runtime.
 
  If couch is rewriting entire node paths with all their elements then I
 am
  no longer confused about the B+ index updates, but am curious about the
  on-disk cost of this.
 
  In my own rough insertion testing, that would explain why I see

Understanding the CouchDB file format

2011-12-20 Thread Riyad Kalla
I've been reading everything I can find on the CouchDB file format[1] and
am getting bits and pieces here and there, but not a great, concrete,
step-by-step explanation of the process.

I'm clear on the use of B+ trees and after reading a few papers on the
benefits of log-structured file formats, I understand the benefits of
inlining the B+ tree indices directly into the data file as well (locality
+ sequential I/O)... what I'm flummoxed about is how much of the B+ tree's
index is rewritten after every modified document.

Consider a CouchDB file that looks more or less like this:

[idx/header][doc1, rev1][idx/header][doc1, rev2]

After each revised doc is written and the b-tree root is rewritten after
that, is that just a modified root node of the B+ tree or the entire B+
tree?

The reason I ask is because regardless of the answer to my previous
question, for a *huge* database will millions of records, that seems like
an enormous amount of data to rewrite after every modification. Say the
root node had a fanning factor of 133; that would still be alot of data to
rewrite.

I am certain I am missing the boat on this; if anyone can pull me out of
the water and point me to dry land I'd appreciate it.

Best,
R



[1]
-- 
http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
-- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
-- http://blog.kodekabuki.com/post/132952897/couchdb-naked
-- http://guide.couchdb.org/editions/1/en/btree.html
-- http://ayende.com/blog/* (Over my head)


Re: Understanding the CouchDB file format

2011-12-20 Thread Riyad Kalla
Filipe,

Thank you for the reply.

Maybe I am misunderstanding exactly what couch is writing out; the docs
I've read say that it rewrites the root node -- I can't tell if the docs
mean the parent node of the child doc that was changed (as one of the b+
leaves) or if it means the direct path, from the root, to that individual
doc... or if it means the *entire* index...

In the case of even rewriting the single parent, with such a shallow tree,
each internal leaf will have a huge fan of nodes; let's say 1-10k in a
decent sized data set.

If you are seeing a few K of extra written out after each changed doc then
that cannot be write... I almost assumed my understanding was wrong because
the sheer volume of data would make performance abysmal if it was true.

Given that... is it just the changed path, from the root to the new leaf
that is rewritten? That makes me all sorts of curious as to how Couch
updates/searches the new modified index with the small diff that is written
out.

Any pointers to reading that will help me dig down on this (even early bugs
in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
Damien's blog to see if it wrote about it in depth and so far haven't found
anything as detailed as I am hoping for on this architecture.

Best,
Riyad

On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana fdman...@apache.orgwrote:

 On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla rka...@gmail.com wrote:
  I've been reading everything I can find on the CouchDB file format[1] and
  am getting bits and pieces here and there, but not a great, concrete,
  step-by-step explanation of the process.
 
  I'm clear on the use of B+ trees and after reading a few papers on the
  benefits of log-structured file formats, I understand the benefits of
  inlining the B+ tree indices directly into the data file as well
 (locality
  + sequential I/O)... what I'm flummoxed about is how much of the B+
 tree's
  index is rewritten after every modified document.
 
  Consider a CouchDB file that looks more or less like this:
 
  [idx/header][doc1, rev1][idx/header][doc1, rev2]
 
  After each revised doc is written and the b-tree root is rewritten
 after
  that, is that just a modified root node of the B+ tree or the entire B+
  tree?
 
  The reason I ask is because regardless of the answer to my previous
  question, for a *huge* database will millions of records, that seems like
  an enormous amount of data to rewrite after every modification. Say the
  root node had a fanning factor of 133; that would still be alot of data
 to
  rewrite.

 Hi Riyad,

 Have you observed that in practice?

 Typically the depth of database btrees is not that high even for
 millions of documents. For example I have one around with about 10
 million documents which doesn't have more than 5 or 6 levels if I
 recall correctly.

 So updating a doc, for that particular case, means rewriting 5 or 6
 new nodes plus the document itself. Each node is normally not much
 bigger than 1.2Kb.

 I've written once a tool to analyze database files which reports btree
 depths, however it's not updated to work with recent changes on
 master/1.2.x such as snappy compression and btree sizes:

 https://github.com/fdmanana/couchfoo

 It should work with CouchDB 1.1 (and older) database files.

 
  I am certain I am missing the boat on this; if anyone can pull me out of
  the water and point me to dry land I'd appreciate it.
 
  Best,
  R
 
 
 
  [1]
  --
 
 http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
  -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
  -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
  -- http://guide.couchdb.org/editions/1/en/btree.html
  -- http://ayende.com/blog/* (Over my head)



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.



Re: Understanding the CouchDB file format

2011-12-20 Thread Riyad Kalla
@Filipe - I was just not clear on how CouchDB operated; you and Robert
cleared that up for me. Thank you.

@Robert - The writeup is excellent so far (I am not familiar with erlang,
so there is a bit of stickiness there), thank you for taking the time to
put this together!

At this point I am curious how the _id and _seq indices are read as their
data is continually appended to the end of the data file in small
diff-trees for every updated doc.

If CouchDB kept all the indices in-memory and simply patched-in the updated
paths at runtime (maybe something akin to memory-mapped indices in MongoDB)
I would be fairly clear on the operation... but as I understand it, CouchDB
keeps such a small memory footprint by doing no in-memory caching and
relying on the intelligence of the OS and filesystem (and/or drives) to
cache frequently accessed data.

I am trying to understand the logic used by CouchDB to answer a query using
the index once updates to the tree have been appended to the data file...
for example, consider a CouchDB datastore like the one Filipe has... 10
million documents and let's say it is freshly compacted.

If I send in a request to that Couch instance, it hits the header of the
data file along with the index and walks the B+ tree to the leaf node,
where it finds the offset into the data file where the actual doc lives...
let's say 1,000,000 bytes away.

These B+ trees are shallow, so it might look something like this:

Level 1: 1 node, root node.
Level 2: 100 nodes, inner child nodes
Level 3: 10,000 nodes, inner child nodes
Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets in
the data file.

Now let's say I write 10 updates to documents in that file. There are 10
new revisions appended to the end of the data file *each one* separated by
a rewritten B+ path to a leaf node with it's new location at the end of the
file. Each of those paths written between each doc revision (say roughly 2k
like Filipe mentioned) are just 4 item paths... root - level1 - level2 -
level3 -- level4... showing the discrete path from the root to that
individual updated doc. The intermediary levels (l1, 2, 3) are not fully
flushed out with all the OTHER children from the original b+ tree index.

[[ is this correct so far? If not, please point out my mistakes...]

Now I issue a query for a document that WAS NOT updated...

 this is where I get confused on the logic ***

this would mean I need to access the original B+ tree index at the root of
the data file, because the revised B+ paths that are written between each
of the updated doc revisions at the end of the file are not full indices.

NOW consider I want to query for one of the changed docs... now I suddenly
need to scan backwards from the data file's end to find the updated path to
the new revision of that document.

(obviously) this isn't what Couch is actually doing... it's doing something
more elegant, I just can't figure out what or how and that is what I was
hoping for help with.

Much thanks guys, I know this is a heavy question to ask.

Best wishes,
R

On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne dio...@dionne-associates.com
 wrote:


 Robert Dionne
 Computer Programmer
 dio...@dionne-associates.com
 203.231.9961




 On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:

  Filipe,
 
  Thank you for the reply.
 
  Maybe I am misunderstanding exactly what couch is writing out; the docs
  I've read say that it rewrites the root node -- I can't tell if the
 docs
  mean the parent node of the child doc that was changed (as one of the b+
  leaves) or if it means the direct path, from the root, to that individual
  doc... or if it means the *entire* index...
 
  In the case of even rewriting the single parent, with such a shallow
 tree,
  each internal leaf will have a huge fan of nodes; let's say 1-10k in a
  decent sized data set.
 
  If you are seeing a few K of extra written out after each changed doc
 then
  that cannot be write... I almost assumed my understanding was wrong
 because
  the sheer volume of data would make performance abysmal if it was true.
 
  Given that... is it just the changed path, from the root to the new leaf
  that is rewritten?

 Hi Riyad,

 You are correct, it's only the changed path. Interestingly I've just
 started to document all these internals[1] along with links to the code and
 other references available.

 Cheers,

 Bob


 [1] http://bdionne.github.com/couchdb/

  That makes me all sorts of curious as to how Couch
  updates/searches the new modified index with the small diff that is
 written
  out.
 
  Any pointers to reading that will help me dig down on this (even early
 bugs
  in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
  Damien's blog to see if it wrote about it in depth and so far haven't
 found
  anything as detailed as I am hoping for on this architecture.
 
  Best,
  Riyad
 
  On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana 
 fdman...@apache.orgwrote:
 
  On Tue, Dec 20

Re: Fast JSON Parsing: UltraJSON / RapidJSON

2011-11-22 Thread Riyad Kalla
Paul, you might be interested in this benchmark as well of 3 C++ JSON
parsers (summary: libjson dominated the test by orders of magnitude)

http://lijoblogs.blogspot.com/2011/11/comparison-and-benchmark-of-c-json.html

On Mon, Nov 21, 2011 at 10:23 AM, Paul Davis paul.joseph.da...@gmail.comwrote:

 On Mon, Nov 21, 2011 at 11:11 AM, Riyad Kalla rka...@gmail.com wrote:
  I believe trunk now utilizes the C JSON parser for some nice speed
  improvements and thought this might be worth sharing.
 
  Over on the JSON spec list two devs were sharing attempts at writing the
  fastest JSON parsers in C and C++ respectively.
 
  UltraJSON (C) claims to have the fastest parsing speed at the moment:
  https://github.com/esnme/ultrajson
  RapidJSON (C++) is attempting to be the fastest as well:
  http://code.google.com/p/rapidjson/
 
  I thought it interesting enough to share these announcements; I realize
  there is much more needed from a JSON parser than just sheer speed.
 
  -R
 

 Oooohhh, interesting. I should see how Jiffy compares. Thanks for the note.



Canonical's decision to move away from CouchDB

2011-11-22 Thread Riyad Kalla
REF:
http://www.h-online.com/open/news/item/Canonical-dropping-CouchDB-from-Ubuntu-One-1382809.html

 Lenton says Canonical has worked with the company behind CouchDB to
make the database scale
 in a particular way. Our situation is rather unique and we were unable
to resolve some of the issues
 we came across, said Lenton, who pointed out they needed to scale to
millions of users and scale
 down to a reasonable load on small client machines.

Does anyone happen to know exactly what was the blocker(s) for Canonical?
e.g. client load time for small mobile devices?


Fast JSON Parsing: UltraJSON / RapidJSON

2011-11-21 Thread Riyad Kalla
I believe trunk now utilizes the C JSON parser for some nice speed
improvements and thought this might be worth sharing.

Over on the JSON spec list two devs were sharing attempts at writing the
fastest JSON parsers in C and C++ respectively.

UltraJSON (C) claims to have the fastest parsing speed at the moment:
https://github.com/esnme/ultrajson
RapidJSON (C++) is attempting to be the fastest as well:
http://code.google.com/p/rapidjson/

I thought it interesting enough to share these announcements; I realize
there is much more needed from a JSON parser than just sheer speed.

-R


Very impressive CouchDB story.

2011-11-08 Thread Riyad Kalla
This is an aside and also a you guys rock to the CouchDB dev team... I
watched a talk by Benjamin Anderson at Meteor today:
http://www.dataversity.net/archives/6714?t=1320768580

where he talked about their experience with running CouchDB (via Cloudant)
for 2 years in production, scaling from what I imagine was 2 or 3 nodes to
a 14-node cluster. In the last year alone their traffic has grown 5x while
their Couch cluster has only grown 20% to cover that.

Benjamin is straight forward with some of the shortcomings they have had,
but the take-away from this that floored me and is a testament to what you
guys do... in all these years after all these 10s of terabytes of data,
they have never lost data or had downtime.

They just kept growing their cluster, adding nodes as needed and got back
to work.

No world-ending events, total cluster failures or total rebuilds. Couch
just happily chugged along, keeping track of data and smiling. (not
discrediting any world Cloudant put in here to make that happen, I just
don't know what it was).

He also specifically called out how the backend API to Couch has
(practically) not changed at all in the last 2 years, allowing them to
focus their development efforts on customer-facing features and not spend
time writing and rewriting backend persistence features as Couch grew and
released new versions.

To all of you committers that spend your nights and weekends on Couch, this
is a glowing testament to the hard decisions you have made.

Really great job guys.

-Riyad


Re: Binary Downloads

2011-11-06 Thread Riyad Kalla

 I don't think that this is too big a problem.


It is, it really is.

Many of you are too close to the problem to see it, but the 10-second
impression when you step into the Couch world (as compared to Mongo, Redis,
Orient, Raven or even Cassandra) is ... where do I download and run this
thing?

How many times have you guys answered Well version 1.1.2 isn't out yet, so
what version are you really using?

and those aren't even the people that gave up and turned the other
direction, those are the ones that actively stayed and dug their way over
to couchbase (either knowingly or unknowingly) and grabbed one of their
binaries instead and they still don't even know what version of couch they
are running.

CouchDB needs pre-build binaries, ready to unzip and run exactly as Jan
enumerated in his first post.

On Sat, Nov 5, 2011 at 6:57 AM, Noah Slater nsla...@tumbolia.org wrote:

 A few thoughts:

 I don't think that this is too big a problem. CouchDB is a server
 tool, and by and large, the type of people who are going to be
 installing it are going to know how to from source, or are going to be
 using a system where the application is pre-packaged for them. You
 wouldn't expect Apache httpd to have one-click installers. Having said
 that, it's nice that MongoDB have a list of pre-built packages to
 download. But they don't have a rigorous community voting process in
 place that requires people to test and vote on each artefact.  I think
 that's a weakness as well as a strength.

 Just to reiterate. The biggest problem in providing official platform
 specific binary downloads is that these artefacts would need to be
 voted on by the community. Dave asks what needs to be in place for
 them to be official. Nothing more than a vote. And therein lies the
 problem. Getting enough people, with the right hardware, to vote on
 all these things at the time of release.

 I don't have any problem with people, individuals or organisations,
 who want to build CouchDB for different platforms and make those
 packages available to our users. One one level, this is exactly what
 Debian and RedHat, for example, are doing already.

 I also have no problem linking through to unofficial packages from our
 downloads page, with the appropriate disclaimers, if it makes it
 easier for our users. I don't care how these packages are built, and I
 don't think there should be a submissions process. We'd just link
 through to any trust-worthy places that promised to keep a reasonably
 up-to-date package archive available.

 But let's bare in mind that for each platform, packaging CouchDB is a
 very specific process. We cannot possibly hope to get that kind of
 experience within the core committer team. We can't even hope to get
 all that experience on the mailing lists. So we're left with two
 options. Either build a set of generic packages that don't fit that
 well into the specifics of each platform, or commit ourselves to
 working with, enabling, and promoting specialised efforts by
 downstream teams. Or at all.

 MongoDB can do this because they don't use a community process for
 releases.

 From what I understand, Jan wants a sort of generic package. A
 directory that has a self-contained CouchDB that you can run. I guess
 this could be useful, in the same way CouchDBX was useful. You get to
 play around with CouchDB real quick. But it's not useful for much more
 than that, in my opinion. I certainly wouldn't advise anyone to use
 this for anything serious. There's no integration with the host
 operating system, and that's a pretty egregious problem for a database
 server. Jason covers this in his response I think. A CouchDB that
 lives in a little sandbox like that is a neutered CouchDB. Fun for
 playing with, but you can't do anything serious with it. And maybe
 that has a place in our ecosystem, but it should come with the
 appropriate warnings,  delineation.

 Speaking as the guy who built the first CouchDB package for Debian, I
 can tell you that building these things requires a lot of specialist,
 and in some case, arcane knowledge. It also requires a lot of care. I
 made sure that when you upgrade from one version to another, it would
 not trash your previous data. Building a proper package is full of
 this sort of stuff. Any system that promises to automate the process
 for you is a quick-fix. Not something I would ever recommend.

 Retooling the source to build proper Erlang applications, or any other
 similar proposal, does not solve our problem. The problem isn't the
 lack of technical means to build these packages. The problem lies in
 being able to build them in the first place (given a lack of varies
 infrastructure available to the build team) and a lack of people to
 vote on them at the point of release.

 If there is any community knowledge in Build CouchDB that we could
 move upstream, and bake in to our configure script of Makefiles, I
 would be totally behind that effort. The whole point of configure is
 to 

Re: High latency (40ms) and low request rate (10 rps) under windows

2011-10-26 Thread Riyad Kalla
Hmm, I see what you mean. If a single connection isnt pegging any of the 
hardware and multiple connections are a magnitude times faster, I am not sure 
what would be keeping the single connection from performning faster.

Maybe someone else can hop in with ideas? 

R

On Oct 26, 2011, at 4:32 PM, Konstantin Cherkasoff k.cherkas...@gmail.com 
wrote:

 Hi!
 
 What happens when you set delayed_commits to true?
 With delayed_commits I got about 150 rps so delayed_commits mode is 15
 times faster.
 
 With that off you are requiring an fsync (iirc)  after every value written 
 and then are at the
 mercy at your disks write performance (but not in the sense of burst-write
 as these are individual, disparate write requests coming in).
 
 Yes, I understand that many small writes and fsyncs can be a
 bottleneck in this case.
 But I found that the increase of number of concurrent requests lead to
 an increase in the request rate.
 For example (100 concurent requests)
 ab -k -t 10 -c 100 -n 100 -p ab.json -T application/json 
 http://localhost:5984/bench
 shows about 300 RPS (writes and fsyncs) on the same hardware.
 
 So, I suppose that 10 requests per second actually is not the limit
 for this hard disk.
 And there may be some other problem.
 
 Also, is this build Couch Single Server from Couchbase or some other Windows 
 build of Couch?
 I tried Couchbase 1.2.0 and
 this build 
 https://github.com/downloads/dch/couchdb/setup-couchdb-1.1.0+COUCHDB-1152_otp_R14B03.exe
 
 
 --
 Konstantin


Re: High latency (40ms) and low request rate (10 rps) under windows

2011-10-25 Thread Riyad Kalla
Konstantin,

What happens when you set delayed_commits to true? With that off you are
requiring an fsync (iirc)  after every value written and then are at the
mercy at your disks write performance (but not in the sense of burst-write
as these are individual, disparate write requests coming in).

Also, is this build Couch Single Server from Couchbase or some other Windows
build of Couch?

-R

On Tue, Oct 25, 2011 at 4:03 PM, Konstantin Cherkasoff 
k.cherkas...@gmail.com wrote:

 *** crossposted to u...@couchdb.apache.org

 Hi!

 I installed CouchDB 1.1.0 on Windows (Winows 7, 32 bit) and tried to
 test performance using ab.exe(Apache Benchmark
 http://httpd.apache.org/docs/2.2/programs/ab.html)
 ab.exe -k -c 1 -n 1000 -p ab.json -T application/json
 http://localhost:5984/bench
 where ab.json contains the simplest document {value: test}.
 This utility do 1000 POSTs with no concurency (-c 1) in single
 connection (-k for keepalive).
 On my laptop (it's NOT low cost netbook) I got 10 requests per
 second.So it is 0.1 second per single request.And CPU and HDD
 utilization is actually ZERO.
 I was just wondering what exactly CouchDB doing all this time (0.1
 second) with single little record?As the utilization of CPU and HDD is
 0%, I believe that they are not the bottleneck.

 So where is bottleneck?


 P.S.All test I did with delayed_commits = false.And I tried
 socket_options = [{nodelay, true}]

 --
 Konstantin Cherkasoff



Re: status of universal binary json in couchdb

2011-10-13 Thread Riyad Kalla
I'd be happy to sponsor getting this work into CouchDB proper if anyone was
interested.

On Tue, Oct 11, 2011 at 10:59 PM, Dominic Tarr dominic.t...@gmail.comwrote:

 okay, interesting.

 that sounds a bit more difficult than the parser! but nevertheless the
 right
 approach.

 I'll admit now that I do not know erlang yet. I've been wanting to get into
 it though, and implementing ubj seemed both interesting and useful.
 refactoring couchbd, prehaps more challenging.

 good to know the lie of the land, thanks for your answer!

 cheers, Dominic



 On Wed, Oct 12, 2011 at 3:43 PM, Paul Davis paul.joseph.da...@gmail.com
 wrote:

  On Tue, Oct 11, 2011 at 11:17 PM, Dominic Tarr dominic.t...@gmail.com
  wrote:
   I read here
  https://plus.google.com/107397941677313236670/posts/LFBB233PKQ1
   that
   UBJ was under consideration?
  
   The spec for ubj looks quite good, but there are not many
 implementations
   available yet.
   I am considering writing some in (node)js and erlang.
  
   would support for ubj be added if there where supporting libraries
   available.
  
   I have searched the archives and not found any discussion on this yet.
  
   cheers, Dominic.
  
 
  Dominic,
 
  We've discussed the possibility (possibility mind you) of adding it.
  As you note it's quite young and hasn't reached wide spread adoption
  and its always hard to guess about the future.
 
  The current status is basically that it'd be interesting to see a
  patch to CouchDB that abstracts the content generation for the JSON
  responses such that alternative content types could be implemented
  along with a patch that uses that API for JSON and UBJSON. Then we'd
  need to look into the performance characteristics and end-user support
  (ie, perhaps it could be a plugin).
 
  I won't sugar coat it. It'll be a tough sell to get something other
  than JSON into CouchDB as an officially supported Content-Type. It'll
  take quite a bit of work marshaling the patch through the technical
  and community aspects of such a change.
 
  That said if you or anyone else wants to move forward with it, I would
  propose a rough outline. First, a patch that abstracts the JSON
  generation into a generic API as well as the JSON reference
  implementation of that API. Second, a robust system for choosing the
  correct content-type based on HTTP Accept headers. And third, an
  alternative implementation using UBJSON.
 
  I'll restate that this would be a significant amount of work and
  there's always the possibility that the end result doesn't include
  alternative content-types in trunk. (Assuming the abstract API was
  solid I don't see us not acception that patch, but I make no promises
  :)
 
  Other notes:
 
  There was a discussion here:
 
 
 
 http://mail-archives.apache.org/mod_mbox/couchdb-dev/201109.mbox/%3CCABn9xAGK0WbPhpkbxyKQo9N1KEsm=rx81v7fvdargpjxjnc...@mail.gmail.com%3E
 
  Also, if you're interested in writing an Erlang version, I'd be
  interested in a patch to Jiffy [1]. I've been meaning to sit down and
  write it but its not a priority.
 
  [1] https://github.com/davisp/jiffy
 



[jira] [Commented] (COUCHDB-1302) Fix couchjs

2011-10-06 Thread Riyad Kalla (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13121871#comment-13121871
 ] 

Riyad Kalla commented on COUCHDB-1302:
--

Now is the perfect opportunity to make this change, once 1.2/2.0 goes out the 
door it likely couldn't be safely addressed again until 3.0 and the uptake in 
CouchDB (from what I see at least) seems to be on the rapid rise.

Making this change in 1.2/2.0 becomes an education issue, so maybe some 
questions around how to be *really* helpful explaining the situation to the 
user, e.g. errors in the server log when the problem is encountered and exactly 
how to work around it pointing at a Wiki page with clarification explaining 
why; maybe warnings during startup when the server does a quick scan for the 
issue (that can be disabled if the user knows they have a good setup); 

Sure it will be painful, but trying to monkey-patch this until 3.0 (or beyond) 
will be much more painful and have a lot more negative impact by that time.

I like Jason's suggestion of some quick-patch tool that might ship with the 
install that attempts the fix automatically or at least gives the users enough 
tools and help/information so as to get them from Point A to B as quickly as 
possible.

 Fix couchjs
 ---

 Key: COUCHDB-1302
 URL: https://issues.apache.org/jira/browse/COUCHDB-1302
 Project: CouchDB
  Issue Type: Improvement
  Components: JavaScript View Server
Affects Versions: 1.1.1, 1.2, 1.3
Reporter: Paul Joseph Davis
Priority: Blocker

 Figure out why some spidermonkeys have an error when doing: 
 eval(function(){})

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Universal Binary JSON in CouchDB

2011-10-04 Thread Riyad Kalla
Hey Randall,

This is something that Paul and I discussed on IRC. The way UBJ is written
out looks something like this ([] blocks are just for readability):
[o][2]
  [s][4]name[s][3][bob]
  [s][3][age][i][31]

Couch can easily prepend or append its own dynamic content in a reply. If it
wants to prepend some information after the object header, the header would
need to be stored and manipulated by couch separately.

For example, if I upload the doc above, Couch would want to take that root
object header of:
[o][2]

and change it to:
[o][4]

before storing it because of the additions of _id and _rev. Actually this
could be as simple as storing a rootObjectCount and have couch dynamically
generate the root every time.

'o' represents object containers with = 254 elements (1 byte for length)
and 'O' represents object containers with up to 2.1 billion elements (4 byte
int).

If couch did that any request coming into the server might look like this:
- client request
-- (server loads root object count)
- server writes back object header: [o][4]
-- (server calculates dynamic data)
- server writes back dynamic content
- server streams raw record data straight off disk to client (no
deserialization)
-- (OPT: server calculates dynamic data)
-- OPT: server streams dynamic data appended

Thoughts?

Best,
Riyad

P.S. There is support in the spec for unbounded container types when couch
doesn't know how much it is streaming back, but that isn't necessary for
retrieving stored docs (but could be handy when responding to view queries
and other requests whose length is not known in advance)

On Tue, Oct 4, 2011 at 12:02 PM, Randall Leeds randall.le...@gmail.comwrote:

 Hey,

 Thanks for this thread.

 I've been interested in ways to reduce the work from disk to client as
 well.
 Unfortunately, the metadata inside the document objects is variable based
 on
 query parameters (_attachments, _revisions, _revs_info...) so the server
 needs to decode the disk binary anyway.

 I would say this is something we should carefully consider for a 2.0 api. I
 know that, for simplicity, many people really like having the underscore
 prefixed attributes mixed in right alongside the document data, but a
 future
 API that separated these could really make things fly.

 -Randall

 On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau bchesn...@gmail.com
 wrote:

  On Thursday, September 29, 2011, Riyad Kalla rka...@gmail.com wrote:
   DISCLAIMER: This looks long, but reads quickly (I hope). If you are in
 a
   rush,
   just check the last 2 sections and see if it sounds interesting.
  
  
   Hi everybody. I am new to the list, but a big fan of Couch and I have
  been
   working
   on something I wanted to share with the group.
  
   My appologies if this isn't the right venue or list ediquette... I
 wasn't
   really
   sure where to start with this conversation.
  
  
   Background
   =
   With the help of the JSON spec community I've been finalizing a
  universal,
   binary JSON format specification that offers 1:1 compatibility with
 JSON.
  
   The full spec is here (http://ubjson.org/) and the quick list of types
  is
   here
   (http://ubjson.org/type-reference/). Differences with existing specs
 and
   Why are
   all addressed on the site in the first few sections.
  
   The goal of the specification was first to maintain 1:1 compatibility
  with
   JSON
   (no custom data structures - like what caused BSON to be rejected in
  Issue
   #702),
   secondly to be as simple to work with as regular JSON (no complex data
   structures or
   encoding/decoding algorithms to implement) and lastly, it had to be
  smaller
   than
   compacted JSON and faster to generate and parse.
  
   Using a test doc that I see Filipe reference in a few of his issues
   (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
   compression:
  
   * Compacted JSON: 3,861 bytes
   * Univ. Binary JSON: 3,056 bytes (20% smaller)
  
   In some other sample data (e.g. jvm-serializers sample data) I see a
 27%
   compression
   with a typical compression range of 20-30%.
  
   While these compression levels are average, the data is written out in
 an
   unmolested
   format that is optimized for read speed (no scanning for null
  terminators)
   and criminally
   simple to work with. (win-win)
  
   I added more clarifying information about compression characteristics
 in
  the
   Size Requirements
   section of the spec for anyone interested.
  
  
   Motivation
   ==
   I've been following the discussions surround a native binary JSON
 format
  for
   the core
   CouchDB file (Issue #1092) which transformed into keeping the format
 and
   utilizing
   Google's Snappy (Issue #1120) to provide what looks to be roughly a
  40-50%
   reduction in file
   size at the cost of running the compression/decompression on every
   read/write.
  
   I realize in light of the HTTP transport and JSON encoding/decoding
 cycle

Re: Universal Binary JSON in CouchDB

2011-10-04 Thread Riyad Kalla
Tim's work is certainly the catalyst for my excitement about the potential
here for Couch.

As Paul pointed out, the correct discussion to have at this point is really
about do we support a binary format for responses and if so which
one? That discussion could go on for an eternity with everyone voting for
their favorite (protobuff, smile, messagepack, etc.).

The only reason I bring up the disk store format discussion into this
conversion is to offer a hat-tip to a future where a binary response format
selected now may dovetail nicely with alternative binary disk formats,
enabling the stream-directly-from-disk scenario.

If we were to hypothetically remove the possibility of the on-disk format
ever changing, then I suppose the decision of binary response format just
becomes an issue of Which one is fast and easy to generate?.

-R

On Tue, Oct 4, 2011 at 12:49 PM, Ladislav Thon ladi...@gmail.com wrote:

 
  That said, the ubjson spec is starting to look reasonable and capable
  to be an alternative content-type produced by CouchDB. If someone were
  to write a patch I'd review it quite enthusiastically.


 Just FYI, there's an experimental support for MessagePack by Tim Anglade:
 https://github.com/timanglade/couchdb/commits/msgpack I thought it might
 be
 interesting in this debate... Tim says it improves performance quite a bit:
 http://blog.cloudant.com/optimizing-couchdb-calls-by-99-percent/ (Tim, if
 you're reading this, thank's for the excellent talk!)

 LT



Universal Binary JSON in CouchDB

2011-09-28 Thread Riyad Kalla
DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a
rush,
just check the last 2 sections and see if it sounds interesting.


Hi everybody. I am new to the list, but a big fan of Couch and I have been
working
on something I wanted to share with the group.

My appologies if this isn't the right venue or list ediquette... I wasn't
really
sure where to start with this conversation.


Background
=
With the help of the JSON spec community I've been finalizing a universal,
binary JSON format specification that offers 1:1 compatibility with JSON.

The full spec is here (http://ubjson.org/) and the quick list of types is
here
(http://ubjson.org/type-reference/). Differences with existing specs and
Why are
all addressed on the site in the first few sections.

The goal of the specification was first to maintain 1:1 compatibility with
JSON
(no custom data structures - like what caused BSON to be rejected in Issue
#702),
secondly to be as simple to work with as regular JSON (no complex data
structures or
encoding/decoding algorithms to implement) and lastly, it had to be smaller
than
compacted JSON and faster to generate and parse.

Using a test doc that I see Filipe reference in a few of his issues
(http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
compression:

* Compacted JSON: 3,861 bytes
* Univ. Binary JSON: 3,056 bytes (20% smaller)

In some other sample data (e.g. jvm-serializers sample data) I see a 27%
compression
with a typical compression range of 20-30%.

While these compression levels are average, the data is written out in an
unmolested
format that is optimized for read speed (no scanning for null terminators)
and criminally
simple to work with. (win-win)

I added more clarifying information about compression characteristics in the
Size Requirements
section of the spec for anyone interested.


Motivation
==
I've been following the discussions surround a native binary JSON format for
the core
CouchDB file (Issue #1092) which transformed into keeping the format and
utilizing
Google's Snappy (Issue #1120) to provide what looks to be roughly a 40-50%
reduction in file
size at the cost of running the compression/decompression on every
read/write.

I realize in light of the HTTP transport and JSON encoding/decoding cycle in
CouchDB, the
Snappy compression cycles are a very small part of the total time the server
spends working.

I found this all interesting, but like I said, I realized up to this point
that Snappy
wasn't any form of bottleneck and the big compression wins server side were
great so I had
nothing to contribute to the conversation.


Catalyst
==
This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD)
and started to
foam at the mouth when I saw his slides where he skipped the JSON
encode/decode cycle
server-side and just generated straight from binary on disk into MessagePack
and got
some phenomenal speedups from the server:
http://i.imgscalr.com/XKqXiLusT.png

I pinged Tim to see what the chances of adding Univ Binary JSON support was
and he seemed
ameanable to the idea as long as I could hand him an Erlang or Ruby impl
(unfortunately,
I am not familiar with either).


ah-HA! moment
==
Today it occurred to me that if CouchDB were able to (at the cost of 20%
more disk space
than it is using with Snappy enabled, but still 20% *less* than before
Snappy was integrated)
use the Universal Binary JSON format as its native storage format AND
support for serving replies
using the same format was added (a-la Tim's work), this would allow CouchDB
to (theoretically)
reply to queries by pulling bytes off disk (or memory) and immediately
streaming them back to
the caller with no intermediary step at all (no Snappy decompress, no Erlang
decode, no JSON encode).

Given that the Univ Binary JSON spec is standard, easy to parse and simple
to convert back to
JSON, adding support for it seemed more consistent with Couch's motto of
ease and simplicity
than say MessagePack or Protobuff which provide better compression but at
the cost of more
complex formats and data types that have no ancillary in JSON.

I don't know the intracacies of Couch's internals, if that is wrong and some
Erlang
manipulation of the data would still be required, I believe it would still
be faster to pull the data
off disk in the Univ Binary JSON format, decode to Erlang native types and
then reply while
skipping the Snappy decompression step.

If it *would* be possible to stream it back un-touched directly from disk,
that seems like
an enhancement that could potentially speed up CouchDB by at least an order
of magnitude.


Conclusion
===
I would appreciate any feedback on this idea from you guys with a lot more
knowledge of
the internals.

I have no problem if this is a horrible idea and never going to happen, I
just wanted to try
and contribute something back.


Thank you all for reading.