[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008434#comment-13008434 ] Filipe Manana commented on COUCHDB-1092: Ok, I tested Paul's branch and here are the results I got: # View generation $ time curl http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1 {total_rows:10,offset:0,rows:[ {id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A} ]} real5m54.197s user0m0.000s sys 0m0.012s $ rm -fr couchdb/tmp/lib/.branch_floats_zip_db_design $ echo 3 /proc/sys/vm/drop_caches $ time curl http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1 {total_rows:10,offset:0,rows:[ {id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A} ]} real5m52.879s user0m0.004s sys 0m0.016s $ rm -fr couchdb/tmp/lib/.branch_floats_zip_db_design $ echo 3 /proc/sys/vm/drop_caches $ time curl http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1 {total_rows:10,offset:0,rows:[ {id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A} ]} real5m59.737s user0m0.000s sys 0m0.016s With my original branch, this takes about 4 minutes, with trunk it takes about 15 minutes. I also ran relaximation several times, with delayed_commits set to false, and here are 2 of those runs: http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400cfd7 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400d306 I think it's easy to see that both reads and writes are worse then trunk. Also, apart from the vhosts and 173-os-daemon-cfg-register.t etap tests, I don't get any tests failing. I think both are known to be failing in trunk, as several people have the same issue. I mentioned several times this branch still has some TODOs and is not finished, and I told it in the very first comment, I thought I was clear about that. Namely I plan on removing the functions couch_doc:from_json_obj and couch_doc:to_json_obj, so that only the new ones I added are used, since they are there also to guarantee correct behaviour. Part of the reason to present a not yet finished and fully polished patch was to get feedback and stimulate the community. Also, I haven't had my last questions answered by Paul. 1) Paul claims his jsonsplice addition helps preventing us getting invalid document bodies getting written to disk, while what I see is that his addition does validation on the document read operations path only 2) Besides that, our current code doesn't ensure that when someone calls one of the couch_doc:update_doc or couch_doc:update_docs, the document bodies are EJSON objects. I have given an example in a CouchDB trunk shell that shows this: http://friendpaste.com/3h2IgFF1RXvwxpDGiMDOdS Therefore, this issue is not introduced by this patch/branch 3) These document update functions still accept EJSON document bodies like before, and I added hooks to make sure they get converted to JSON binaries - therefore either users or developers using the couch_db API, don't need to change they're code at all. Two new Etap tests were added, exclusively dedicated to the new functions that were added to the couch_doc module, I intended to remove the tests 030-doc-from-json.t and 031-doc-to-json.t since they test the old functions couch_doc:from_json_obj and couch_doc:to_json_obj, that I planned to remove from couch_doc, as I pointed before in this comment. Adam's suggestion is not integrated yet as well. I hope my comments and points are more clear now. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008488#comment-13008488 ] Paul Joseph Davis commented on COUCHDB-1092: 1) I never claimed that this was a write time check or intended it to be such. My concern is that doc_to_json would be capable of producing invalid JSON. I would also like to point out that your example of writing an invalid EJSON body would currently trigger the error I desire. By doing the binary manipulations at the JSON level we lose that stringency. 2) I would be more than happy to add a validation step to couch_db_updater to ensure that docs are valid before writing. 3) The issue that I'm gesticulating at is that this opens us up to the possibility of emitting invalid JSON by way of really wonky bit twiddling on binaries of uncertain status. The reason I say these are uncertain is because this API is so porous. If it were a closed API meaning that it was required that people use calls like couch_doc:set_body(Doc, NewBody) that could validate all data going in and out at primary junctures, then that'd be fine, but that patch would need to refactor large swaths of code I imagine. To be more specific, I'm not tied to the JSON splicer method. The concern I want to see addressed is avoiding the requirement that we rely on JSON data being specifically formatted while exposing that value as editable to client code. Expounding on my comment above, if the #doc{} record definition were module scope only and we forced all modifications to go through an API, then I think we could avoid re-validation, but that sounds like it'd involve more code changes. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a
[jira] Updated: (COUCHDB-994) Crash after compacting large views
[ https://issues.apache.org/jira/browse/COUCHDB-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Kocoloski updated COUCHDB-994: --- Skill Level: Committers Level (Medium to Hard) Crash after compacting large views -- Key: COUCHDB-994 URL: https://issues.apache.org/jira/browse/COUCHDB-994 Project: CouchDB Issue Type: Bug Affects Versions: 1.0.2 Environment: Centos5 64bit vm with 2CPU and 4G RAM running Erlang R14B and configured to use the 64bit js-devel libraries. URL: http://svn.apache.org/repos/asf/couchdb/branches/1.0.x Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1050680 Reporter: Bob Clary Attachments: couch_errors.txt, couch_errors_2.txt The database has over 9 million records. Several of the views are relatively dense in that they emit a key for most documents. The views are successfully created initially but with relatively large sizes from 20 to 95G. When attempting to compact them, the server will crash upon completion of the compaction. This does not occur with the released 1.0.1 version but does with the 1.0.x svn version. I'll attach example logs. Unfortunately they are level error and may not have enough information. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (COUCHDB-994) Crash after compacting large views
[ https://issues.apache.org/jira/browse/COUCHDB-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008516#comment-13008516 ] Adam Kocoloski commented on COUCHDB-994: I had a chance to delve into this and I think there's a very real bug here. The problem is two-fold: 1) The header for the compacted index is written much later than it could be. 2) We don't try to use the .compact index storage if the primary storage is missing. We can fix the first problem if we simply write a header in the view compactor process before we send the compact_done message to the view group. The current system doesn't write the header until the view group processes the 'delayed_commit' message that it sends to itself when it switches the file over. If the group process closes at any point in the interim we're going to reset. The second problem seems a bit tricky on the face of it, but I think it will work out OK. View group compaction, unlike database compaction, is not all that easy to resume. We don't have access to detailed sequence numbers for each piece of the tree; all we have is a single current_seq for the entire group. But that's alright. I think all we need to do is implement the fix for #1, and then change the view group process to check for the presence of a .compact file if the primary storage is missing. Then one of 3 things can happen 1) no .compact file, so we create a new file and index from scratch 2) .compact file is partially written, but if we seek backwards to find the header it will still have current_seq = 0, so we'll index from scratch 3) .compact file is fully written and has a valid current_seq. We rename it and have a successful recovery. The second case can potentially block the view group for a long period of time as it scans backwards through the file. The third case is obviously the one we have to be the most careful about, particularly when it comes to database deletion. It looks to me like couch_view:do_reset_indexes/2 takes care of .compact files as well as primary storage, so I don't think adding recovery changed our behavior at all there. Crash after compacting large views -- Key: COUCHDB-994 URL: https://issues.apache.org/jira/browse/COUCHDB-994 Project: CouchDB Issue Type: Bug Affects Versions: 1.0.2 Environment: Centos5 64bit vm with 2CPU and 4G RAM running Erlang R14B and configured to use the 64bit js-devel libraries. URL: http://svn.apache.org/repos/asf/couchdb/branches/1.0.x Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1050680 Reporter: Bob Clary Attachments: couch_errors.txt, couch_errors_2.txt The database has over 9 million records. Several of the views are relatively dense in that they emit a key for most documents. The views are successfully created initially but with relatively large sizes from 20 to 95G. When attempting to compact them, the server will crash upon completion of the compaction. This does not occur with the released 1.0.1 version but does with the 1.0.x svn version. I'll attach example logs. Unfortunately they are level error and may not have enough information. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
CouchDB exceptions
Been seeing this on our production CouchDB's (1.0.2) sporadically. We are using the _changes feed, background view indexing and automatic compaction. Uncaught error in HTTP request: {exit, {noproc, {gen_server,call, [0.1478.0, {pread_iolist,290916}, infinity]}}} Stacktrace: [{gen_server,call,3}, {couch_file,pread_iolist,2}, {couch_file,pread_binary,2}, {couch_file,pread_term,2}, {couch_db,make_doc,5}, {couch_db,open_doc_int,3}, {couch_db,open_doc,3}, {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}] Not reproducible yet, but it seems compacting while there are active _changes listeners seems to trigger this. After the exception the _changes listeners are disconnected which then connect back and everything goes back to normal. beam itself holds up, though last night it terminated with no logs, nothing. Just poof. Any ideas? Thanks, K. --- http://blitz.io http://twitter.com/pcapr
Re: CouchDB exceptions
Ah, I think the issue is while we are folding the by sequence btree, we are not checking if the database file changed. So if compaction finishes before finishing the btree fold, we reach that error. I can't see right now any other situation, involving _changes, that might cause that issue. On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote: Been seeing this on our production CouchDB's (1.0.2) sporadically. We are using the _changes feed, background view indexing and automatic compaction. Uncaught error in HTTP request: {exit, {noproc, {gen_server,call, [0.1478.0, {pread_iolist,290916}, infinity]}}} Stacktrace: [{gen_server,call,3}, {couch_file,pread_iolist,2}, {couch_file,pread_binary,2}, {couch_file,pread_term,2}, {couch_db,make_doc,5}, {couch_db,open_doc_int,3}, {couch_db,open_doc,3}, {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}] Not reproducible yet, but it seems compacting while there are active _changes listeners seems to trigger this. After the exception the _changes listeners are disconnected which then connect back and everything goes back to normal. beam itself holds up, though last night it terminated with no logs, nothing. Just poof. Any ideas? Thanks, K. --- http://blitz.io http://twitter.com/pcapr -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008555#comment-13008555 ] Randall Leeds commented on COUCHDB-1092: I love a good bike shed more than most, but I've stayed pretty quiet since my first comment because I wanted to think hard about what Paul was saying. In the end, I agree with the last comment. I would be happy to trust the md5 and not validate on the way out _only_ so long as we close the API for manipulating docs and validate on the way in. Paul, if I understand correctly, this sort of change should make you rest easy. The internal API change would mean more code refactoring, but we shouldn't be afraid of that. The agile way forward, if people agree that this solution is prudent, would be to commit to trunk and open a blocking ticket to close down the document body API before release. Trunk is trunk, lets iterate on it. We haven't even shipped 1.1 yet! We could even branch a feature frozen trunk for 1.2 and drop this on trunk targeted for 1.3. I'd love to see the 1.2 cycle stay short and in general to have more frequent releases. It's something I feel we talk about a lot but then we sit around and comment on tickets like this without taking the dive and committing. I don't mean that to sound like a rant. 3. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a CouchDB instance built from the current trunk. At the end both databases were compacted (to fairly compare their final sizes). The databases contain the following view: { _id: _design/test, language: javascript, views: { simple: { map: function(doc) { emit(doc.float1, doc.strings[1]); } } } } ##
Re: [jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
On Mar 18, 2011, at 2:08 PM, Randall Leeds (JIRA) wrote: [ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008555#comment-13008555 ] Randall Leeds commented on COUCHDB-1092: I love a good bike shed more than most, but I've stayed pretty quiet since my first comment because I wanted to think hard about what Paul was saying. In the end, I agree with the last comment. I would be happy to trust the md5 and not validate on the way out _only_ so long as we close the API for manipulating docs and validate on the way in. Paul, if I understand correctly, this sort of change should make you rest easy. I've also been watching this thread with no comment, but would +1 your proposal if I understand it correctly. I think the main concern is summarized in Paul's last post (Paul tell me to shut up if I'm wrong): The concern I want to see addressed is avoiding the requirement that we rely on JSON data being specifically formatted while exposing that value as editable to client code. -- davisp Essentially the code isn't architected properly to support this change without adding the risk of data corruption, and any amount of that is bad. Your proposal Randall is to go forward with it subject to the constraint that more refactoring is done to clean up the APIs before it's published. If so then I'd say go for it. More frequent releases and more progress would be valuable. I've seen a lot of forks and good ideas on github and would love to see more of it on trunk, .eg. Paul's btree cleanup. The internal API change would mean more code refactoring, but we shouldn't be afraid of that. The agile way forward, if people agree that this solution is prudent, would be to commit to trunk and open a blocking ticket to close down the document body API before release. Trunk is trunk, lets iterate on it. We haven't even shipped 1.1 yet! We could even branch a feature frozen trunk for 1.2 and drop this on trunk targeted for 1.3. I'd love to see the 1.2 cycle stay short and in general to have more frequent releases. It's something I feel we talk about a lot but then we sit around and comment on tickets like this without taking the dive and committing. I don't mean that to sound like a rant. 3. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008570#comment-13008570 ] Paul Joseph Davis commented on COUCHDB-1092: @Randall and @Bon-on-dev, You've both summarized my concerns and thoughts on how to address them. I'd be a bit hesitant to commit the non-private-api version to trunk with the expectation that it gets fixed because those sorts of things have a habit of never getting resolved. Though I wouldn't argue to forcefully against preventing it if everyone is on board with that approach. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a CouchDB instance built from the current trunk. At the end both databases were compacted (to fairly compare their final sizes). The databases contain the following view: { _id: _design/test, language: javascript, views: { simple: { map: function(doc) { emit(doc.float1, doc.strings[1]); } } } } ## Database with 500 000 docs of 2.5Kb each Document template is at: https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json Sizes (branch vs trunk): $ du -m couchdb/tmp/lib/disk_json_test.couch 1996 couchdb/tmp/lib/disk_json_test.couch $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 2693 couchdb-trunk/tmp/lib/disk_ejson_test.couch Time, from a user's perpective, to build the view index from scratch: $ time curl http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[ {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6} ]} real 6m6.740s user 0m0.016s sys 0m0.008s $ time curl
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008573#comment-13008573 ] Benoit Chesneau commented on COUCHDB-1092: -- maybe there could be a new branch for that? Also I would prefer correctness about optimisation too at this level. Even if we don't support erlang api (though I wonder what does it means in erlang) it's better to make sure everything is good at the lowest level possible. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a CouchDB instance built from the current trunk. At the end both databases were compacted (to fairly compare their final sizes). The databases contain the following view: { _id: _design/test, language: javascript, views: { simple: { map: function(doc) { emit(doc.float1, doc.strings[1]); } } } } ## Database with 500 000 docs of 2.5Kb each Document template is at: https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json Sizes (branch vs trunk): $ du -m couchdb/tmp/lib/disk_json_test.couch 1996 couchdb/tmp/lib/disk_json_test.couch $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 2693 couchdb-trunk/tmp/lib/disk_ejson_test.couch Time, from a user's perpective, to build the view index from scratch: $ time curl http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[ {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6} ]} real 6m6.740s user 0m0.016s sys 0m0.008s $ time curl http://localhost:5985/disk_ejson_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008576#comment-13008576 ] Randall Leeds commented on COUCHDB-1092: I'm +0 on a new branch for this. On one hand, that's a pretty good way to handle iterating on a feature. On the other hand, I think it's pretty clear we love the performance and space savings we're seeing and I think putting it on trunk is a good way to commit (pun intended) to following through. We don't release with blocking issues in JIRA. If it were on trunk on blocking release I would have little fear of it languishing. Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a CouchDB instance built from the current trunk. At the end both databases were compacted (to fairly compare their final sizes). The databases contain the following view: { _id: _design/test, language: javascript, views: { simple: { map: function(doc) { emit(doc.float1, doc.strings[1]); } } } } ## Database with 500 000 docs of 2.5Kb each Document template is at: https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json Sizes (branch vs trunk): $ du -m couchdb/tmp/lib/disk_json_test.couch 1996 couchdb/tmp/lib/disk_json_test.couch $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 2693 couchdb-trunk/tmp/lib/disk_ejson_test.couch Time, from a user's perpective, to build the view index from scratch: $ time curl http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[ {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6} ]} real 6m6.740s user 0m0.016s sys 0m0.008s $ time curl
[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms
[ https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008578#comment-13008578 ] Benoit Chesneau commented on COUCHDB-1092: -- i'm +1 ofor using trunk as trunk (should be another ticket) , but at some point we should all agree on one policy. Last commits tend to show that some here are preferring using tickets with patches . Storing documents bodies as raw JSON binaries instead of serialized JSON terms -- Key: COUCHDB-1092 URL: https://issues.apache.org/jira/browse/COUCHDB-1092 Project: CouchDB Issue Type: Improvement Components: Database Core Reporter: Filipe Manana Assignee: Filipe Manana Currently we store documents as Erlang serialized (via the term_to_binary/1 BIF) EJSON. The proposed patch changes the database file format so that instead of storing serialized EJSON document bodies, it stores raw JSON binaries. The github branch is at: https://github.com/fdmanana/couchdb/tree/raw_json_docs Advantages: * what we write to disk is much smaller - a raw JSON binary can easily get up to 50% smaller (at least according to the tests I did) * when serving documents to a client we no longer need to JSON encode the document body read from the disk - this applies to individual document requests, view queries with ?include_docs=true, pull and push replications, and possibly other use cases. We just grab its body and prepend the _id, _rev and all the necessary metadata fields (this is via simple Erlang binary operations) * we avoid the EJSON term copying between request handlers and the db updater processes, between the work queues and the view updater process, between replicator processes, etc * before sending a document to the JavaScript view server, we no longer need to convert it from EJSON to JSON The changes done to the document write workflow are minimalist - after JSON decoding the document's JSON into EJSON and removing the metadata top level fields (_id, _rev, etc), it JSON encodes the resulting EJSON body into a binary - this consumes CPU of course but it brings 2 advantages: 1) we avoid the EJSON copy between the request process and the database updater process - for any realistic document size (4kb or more) this can be very expensive, specially when there are many nested structures (lists inside objects inside lists, etc) 2) before writing anything to the file, we do a term_to_binary([Len, Md5, TheThingToWrite]) and then write the result to the file. A term_to_binary call with a binary as the input is very fast compared to a term_to_binary call with EJSON as input (or some other nested structure) I think both compensate the JSON encoding after the separation of meta data fields and non-meta data fields. The following relaximation graph, for documents with sizes of 4Kb, shows a significant performance increase both for writes and reads - especially reads. http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f I've also made a few tests to see how much the improvement is when querying a view, for the first time, without ?stale=ok. The size difference of the databases (after compaction) is also very significant - this change can reduce the size at least 50% in common cases. The test databases were created in an instance built from that experimental branch. Then they were replicated into a CouchDB instance built from the current trunk. At the end both databases were compacted (to fairly compare their final sizes). The databases contain the following view: { _id: _design/test, language: javascript, views: { simple: { map: function(doc) { emit(doc.float1, doc.strings[1]); } } } } ## Database with 500 000 docs of 2.5Kb each Document template is at: https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json Sizes (branch vs trunk): $ du -m couchdb/tmp/lib/disk_json_test.couch 1996 couchdb/tmp/lib/disk_json_test.couch $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 2693 couchdb-trunk/tmp/lib/disk_ejson_test.couch Time, from a user's perpective, to build the view index from scratch: $ time curl http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[ {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6} ]} real 6m6.740s user 0m0.016s sys 0m0.008s $ time curl http://localhost:5985/disk_ejson_test/_design/test/_view/simple?limit=1 {total_rows:50,offset:0,rows:[ {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6} ]} real 15m41.439s user 0m0.012s sys 0m0.012s ## Database
Re: CouchDB exceptions
That was quick! Filed in JIRA: https://issues.apache.org/jira/browse/COUCHDB-1093 Thanks Filipe, K. --- http://blitz.io http://twitter.com/pcapr On Fri, Mar 18, 2011 at 10:54 AM, Filipe David Manana fdman...@apache.org wrote: Ah, I think the issue is while we are folding the by sequence btree, we are not checking if the database file changed. So if compaction finishes before finishing the btree fold, we reach that error. I can't see right now any other situation, involving _changes, that might cause that issue. On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote: Been seeing this on our production CouchDB's (1.0.2) sporadically. We are using the _changes feed, background view indexing and automatic compaction. Uncaught error in HTTP request: {exit, {noproc, {gen_server,call, [0.1478.0, {pread_iolist,290916}, infinity]}}} Stacktrace: [{gen_server,call,3}, {couch_file,pread_iolist,2}, {couch_file,pread_binary,2}, {couch_file,pread_term,2}, {couch_db,make_doc,5}, {couch_db,open_doc_int,3}, {couch_db,open_doc,3}, {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}] Not reproducible yet, but it seems compacting while there are active _changes listeners seems to trigger this. After the exception the _changes listeners are disconnected which then connect back and everything goes back to normal. beam itself holds up, though last night it terminated with no logs, nothing. Just poof. Any ideas? Thanks, K. --- http://blitz.io http://twitter.com/pcapr -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
[jira] Created: (COUCHDB-1093) Exceptions related to _changes + compact
Exceptions related to _changes + compact Key: COUCHDB-1093 URL: https://issues.apache.org/jira/browse/COUCHDB-1093 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 1.0.2 Environment: I don't believe this is OS and/or hardware elated, but I'm running on a redhat 32-bit linux kernel Reporter: kowsik From the last thread on the dev mailing list: On Fri, Mar 18, 2011 at 10:54 AM, Filipe David Manana fdman...@apache.org wrote: Ah, I think the issue is while we are folding the by sequence btree, we are not checking if the database file changed. So if compaction finishes before finishing the btree fold, we reach that error. I can't see right now any other situation, involving _changes, that might cause that issue. On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote: Been seeing this on our production CouchDB's (1.0.2) sporadically. We are using the _changes feed, background view indexing and automatic compaction. Uncaught error in HTTP request: {exit, {noproc, {gen_server,call, [0.1478.0, {pread_iolist,290916}, infinity]}}} Stacktrace: [{gen_server,call,3}, {couch_file,pread_iolist,2}, {couch_file,pread_binary,2}, {couch_file,pread_term,2}, {couch_db,make_doc,5}, {couch_db,open_doc_int,3}, {couch_db,open_doc,3}, {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}] Not reproducible yet, but it seems compacting while there are active _changes listeners seems to trigger this. After the exception the _changes listeners are disconnected which then connect back and everything goes back to normal. beam itself holds up, though last night it terminated with no logs, nothing. Just poof. Any ideas? Thanks, K. --- http://blitz.io http://twitter.com/pcapr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (COUCHDB-867) Add http handlers for root files with special meanings, such as crossdomain.xml.
[ https://issues.apache.org/jira/browse/COUCHDB-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008653#comment-13008653 ] edward de jong commented on COUCHDB-867: i am trying to get flash to connect to my database, and desperately need this crossdomain.xml file to get served up. can somebody please tell me how to create the local.ini file, and where that file is supposed to go? i don't see a local.ini file on my machine... Add http handlers for root files with special meanings, such as crossdomain.xml. Key: COUCHDB-867 URL: https://issues.apache.org/jira/browse/COUCHDB-867 Project: CouchDB Issue Type: Improvement Components: HTTP Interface Affects Versions: 1.0.1 Reporter: Eric Desgranges Attachments: handle_file_req.diff Some files at the root level of a website have a special meaning, such as favicon.ico storing the favorite icon, which is processed correctly in the [httpd_global_handlers] section of the ini file with this instruction: favicon.ico = {couch_httpd_misc_handlers, handle_favicon_req, ../share/couchdb/www} But this is the only one handled while other files, which are critical when to accessing the CouchDB server from Flash, Flex, Silverlight..., are missing - crossdomain.xml (this one should be a top priority fix!) - clientaccesspolicy.xml -- See http://msdn.microsoft.com/en-us/library/cc838250%28v=VS.95%29.aspx#crossdomain_communication And there's also 'robots.txt' to prevent search engines from accessing some files / directories. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira