date:20110318

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Filipe Manana (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008434#comment-13008434
 ] 

Filipe Manana commented on COUCHDB-1092:


Ok, I tested Paul's branch and here are the results I got:

# View generation

$ time curl 
http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1
{total_rows:10,offset:0,rows:[
{id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A}
]}

real5m54.197s
user0m0.000s
sys 0m0.012s

$ rm -fr couchdb/tmp/lib/.branch_floats_zip_db_design
$ echo 3  /proc/sys/vm/drop_caches
$ time curl 
http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1
{total_rows:10,offset:0,rows:[
{id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A}
]}

real5m52.879s
user0m0.004s
sys 0m0.016s

$ rm -fr couchdb/tmp/lib/.branch_floats_zip_db_design
$ echo 3  /proc/sys/vm/drop_caches
$ time curl 
http://localhost:5984/branch_floats_zip_db/_design/test/_view/simple?limit=1
{total_rows:10,offset:0,rows:[
{id:6649-c7d2-45d0-b926-86a39e4db4d0,key:null,value:2fQUbzRUax4A}
]}

real5m59.737s
user0m0.000s
sys 0m0.016s

With my original branch, this takes about 4 minutes, with trunk it takes about 
15 minutes.

I also ran relaximation several times, with delayed_commits set to false, and 
here are 2 of those runs:

http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400cfd7
http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400d306

I think it's easy to see that both reads and writes are worse then trunk.

Also, apart from the vhosts and 173-os-daemon-cfg-register.t etap tests, I 
don't get any tests failing. I think both are known to be failing in trunk, as 
several people have the same issue.

I mentioned several times this branch still has some TODOs and is not finished, 
and I told it in the very first comment, I thought I was clear about that. 
Namely I plan on removing the functions couch_doc:from_json_obj and 
couch_doc:to_json_obj, so that only the new ones I added are used, since they 
are there also to guarantee correct behaviour. Part of the reason to present a 
not yet finished and fully polished patch was to get feedback and stimulate the 
community.

Also, I haven't had my last questions answered by Paul.

1) Paul claims his jsonsplice addition helps preventing us getting invalid 
document bodies getting written to disk, while what I see is that his addition 
does validation on the document read operations path only

2) Besides that, our current code doesn't ensure that when someone calls one of 
the couch_doc:update_doc or couch_doc:update_docs, the document bodies are 
EJSON objects. I have given an example in a CouchDB trunk shell that shows 
this:  http://friendpaste.com/3h2IgFF1RXvwxpDGiMDOdS

Therefore, this issue is not introduced by this patch/branch

3) These document update functions still accept EJSON document bodies like 
before, and I added hooks to make sure they get converted to JSON binaries - 
therefore either users or developers using the couch_db API, don't need to 
change they're code at all.

Two new Etap tests were added, exclusively dedicated to the new functions that 
were added to the couch_doc module, I intended to remove the tests 
030-doc-from-json.t and 031-doc-to-json.t since they test the old functions 
couch_doc:from_json_obj and couch_doc:to_json_obj, that I planned to remove 
from couch_doc, as I pointed before in this comment. Adam's suggestion is not 
integrated yet as well.

I hope my comments and points are more clear now.

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Paul Joseph Davis (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008488#comment-13008488
 ] 

Paul Joseph Davis commented on COUCHDB-1092:


1) I never claimed that this was a write time check or intended it to be such. 
My concern is that doc_to_json would be capable of producing invalid JSON. I 
would also like to point out that your example of writing an invalid EJSON body 
would currently trigger the error I desire. By doing the binary manipulations 
at the JSON level we lose that stringency.

2) I would be more than happy to add a validation step to couch_db_updater to 
ensure that docs are valid before writing.

3) The issue that I'm gesticulating at is that this opens us up to the 
possibility of emitting invalid JSON by way of really wonky bit twiddling on 
binaries of uncertain status.

The reason I say these are uncertain is because this API is so porous. If it 
were a closed API meaning that it was required that people use calls like 
couch_doc:set_body(Doc, NewBody) that could validate all data going in and out 
at primary junctures, then that'd be fine, but that patch would need to 
refactor large swaths of code I imagine.

To be more specific, I'm not tied to the JSON splicer method. The concern I 
want to see addressed is avoiding the requirement that we rely on JSON data 
being specifically formatted while exposing that value as editable to client 
code. Expounding on my comment above, if the #doc{} record definition were 
module scope only and we forced all modifications to go through an API, then I 
think we could avoid re-validation, but that sounds like it'd involve more code 
changes.

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple Erlang binary operations)
 * we avoid the EJSON term copying between request handlers and the db updater 
 processes,
   between the work queues and the view updater process, between replicator 
 processes, etc
 * before sending a document to the JavaScript view server, we no longer need 
 to convert it
   from EJSON to JSON
 The changes done to the document write workflow are minimalist - after JSON 
 decoding the
 document's JSON into EJSON and removing the metadata top level fields (_id, 
 _rev, etc), it
 JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
 course but it
 brings 2 advantages:
 1) we avoid the EJSON copy between the request process and the database 
 updater process -
for any realistic document size (4kb or more) this can be very expensive, 
 specially
when there are many nested structures (lists inside objects inside lists, 
 etc)
 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
 TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary 
 as the input
is very fast compared to a term_to_binary call with EJSON as input (or 
 some other nested
structure)
 I think both compensate the JSON encoding after the separation of meta data 
 fields and non-meta data fields.
 The following relaximation graph, for documents with sizes of 4Kb, shows a 
 significant
 performance increase both for writes and reads - especially reads.   
 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
 I've also made a few tests to see how much the improvement is when querying a 
 view, for the
 first time, without ?stale=ok. The size difference of the databases (after 
 compaction) is
 also very significant - this change can reduce the size at least 50% in 
 common cases.
 The test databases were created in an instance built from that experimental 
 branch.
 Then they were replicated into a

[jira] Updated: (COUCHDB-994) Crash after compacting large views

2011-03-18 Thread Adam Kocoloski (JIRA)


 [ 
https://issues.apache.org/jira/browse/COUCHDB-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kocoloski updated COUCHDB-994:
---

Skill Level: Committers Level (Medium to Hard)

 Crash after compacting large views
 --

 Key: COUCHDB-994
 URL: https://issues.apache.org/jira/browse/COUCHDB-994
 Project: CouchDB
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Centos5 64bit vm with 2CPU and 4G RAM running Erlang 
 R14B and configured to use the 64bit js-devel libraries.
 URL: http://svn.apache.org/repos/asf/couchdb/branches/1.0.x
 Repository Root: http://svn.apache.org/repos/asf
 Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
 Revision: 1050680
Reporter: Bob Clary
 Attachments: couch_errors.txt, couch_errors_2.txt


 The database has over 9 million records. Several of the views are relatively 
 dense in that they emit a key for most documents. The views are successfully 
 created initially but with relatively large sizes from 20 to 95G. When 
 attempting to compact them, the server will crash upon completion of the 
 compaction.
 This does not occur with the released 1.0.1 version but does with the 1.0.x 
 svn version. I'll attach example logs. Unfortunately they are level error and 
 may not have enough information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (COUCHDB-994) Crash after compacting large views

2011-03-18 Thread Adam Kocoloski (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008516#comment-13008516
]

Adam Kocoloski commented on COUCHDB-994:

I had a chance to delve into this and I think there's a very real bug here.
The problem is two-fold:

1) The header for the compacted index is written much later than it could be.
2) We don't try to use the .compact index storage if the primary storage is
missing.

We can fix the first problem if we simply write a header in the view compactor
process before we send the compact_done message to the view group. The current
system doesn't write the header until the view group processes the
'delayed_commit' message that it sends to itself when it switches the file
over. If the group process closes at any point in the interim we're going to
reset.

The second problem seems a bit tricky on the face of it, but I think it will
work out OK. View group compaction, unlike database compaction, is not all
that easy to resume. We don't have access to detailed sequence numbers for
each piece of the tree; all we have is a single current_seq for the entire
group. But that's alright. I think all we need to do is implement the fix for
#1, and then change the view group process to check for the presence of a
.compact file if the primary storage is missing. Then one of 3 things can
happen

1) no .compact file, so we create a new file and index from scratch
2) .compact file is partially written, but if we seek backwards to find the
header it will still have current_seq = 0, so we'll index from scratch
3) .compact file is fully written and has a valid current_seq. We rename it
and have a successful recovery.

The second case can potentially block the view group for a long period of time
as it scans backwards through the file. The third case is obviously the one we
have to be the most careful about, particularly when it comes to database
deletion. It looks to me like couch_view:do_reset_indexes/2 takes care of
.compact files as well as primary storage, so I don't think adding recovery
changed our behavior at all there.

Crash after compacting large views
--

Key: COUCHDB-994
URL: https://issues.apache.org/jira/browse/COUCHDB-994
Project: CouchDB
Issue Type: Bug
Affects Versions: 1.0.2
Environment: Centos5 64bit vm with 2CPU and 4G RAM running Erlang
R14B and configured to use the 64bit js-devel libraries.
URL: http://svn.apache.org/repos/asf/couchdb/branches/1.0.x
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 1050680
Reporter: Bob Clary
Attachments: couch_errors.txt, couch_errors_2.txt

The database has over 9 million records. Several of the views are relatively
dense in that they emit a key for most documents. The views are successfully
created initially but with relatively large sizes from 20 to 95G. When
attempting to compact them, the server will crash upon completion of the
compaction.
This does not occur with the released 1.0.1 version but does with the 1.0.x
svn version. I'll attach example logs. Unfortunately they are level error and
may not have enough information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

CouchDB exceptions

2011-03-18 Thread kowsik

Been seeing this on our production CouchDB's (1.0.2) sporadically. We
are using the _changes feed, background view indexing and automatic
compaction.

Uncaught error in HTTP request: {exit,
 {noproc,
  {gen_server,call,
   [0.1478.0,
{pread_iolist,290916},
infinity]}}}

 Stacktrace: [{gen_server,call,3},
 {couch_file,pread_iolist,2},
 {couch_file,pread_binary,2},
 {couch_file,pread_term,2},
 {couch_db,make_doc,5},
 {couch_db,open_doc_int,3},
 {couch_db,open_doc,3},
 {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}]

Not reproducible yet, but it seems compacting while there are active
_changes listeners seems to trigger this. After the exception the
_changes listeners are disconnected which then connect back and
everything goes back to normal. beam itself holds up, though last
night it terminated with no logs, nothing. Just poof. Any ideas?

Thanks,

K.
---
http://blitz.io
http://twitter.com/pcapr

Re: CouchDB exceptions

2011-03-18 Thread Filipe David Manana

Ah,

I think the issue is while we are folding the by sequence btree, we
are not checking if the database file changed. So if compaction
finishes before finishing the btree fold, we reach that error.
I can't see right now any other situation, involving _changes, that
might cause that issue.


On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote:
 Been seeing this on our production CouchDB's (1.0.2) sporadically. We
 are using the _changes feed, background view indexing and automatic
 compaction.

 Uncaught error in HTTP request: {exit,
                                 {noproc,
                                  {gen_server,call,
                                   [0.1478.0,
                                    {pread_iolist,290916},
                                    infinity]}}}

  Stacktrace: [{gen_server,call,3},
             {couch_file,pread_iolist,2},
             {couch_file,pread_binary,2},
             {couch_file,pread_term,2},
             {couch_db,make_doc,5},
             {couch_db,open_doc_int,3},
             {couch_db,open_doc,3},
             {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}]

 Not reproducible yet, but it seems compacting while there are active
 _changes listeners seems to trigger this. After the exception the
 _changes listeners are disconnected which then connect back and
 everything goes back to normal. beam itself holds up, though last
 night it terminated with no logs, nothing. Just poof. Any ideas?

 Thanks,

 K.
 ---
 http://blitz.io
 http://twitter.com/pcapr




-- 
Filipe David Manana,
fdman...@gmail.com, fdman...@apache.org

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Randall Leeds (JIRA)

[
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008555#comment-13008555
]

Randall Leeds commented on COUCHDB-1092:

I love a good bike shed more than most, but I've stayed pretty quiet since my
first comment because I wanted to think hard about what Paul was saying.
In the end, I agree with the last comment. I would be happy to trust the md5
and not validate on the way out _only_ so long as we close the API for
manipulating docs and validate on the way in. Paul, if I understand correctly,
this sort of change should make you rest easy.

The internal API change would mean more code refactoring, but we shouldn't be
afraid of that.
The agile way forward, if people agree that this solution is prudent, would be
to commit to trunk and open a blocking ticket to close down the document body
API before release.

Trunk is trunk, lets iterate on it. We haven't even shipped 1.1 yet! We could
even branch a feature frozen trunk for 1.2 and drop this on trunk targeted for
1.3.
I'd love to see the 1.2 cycle stay short and in general to have more frequent
releases. It's something I feel we talk about a lot but then we sit around and
comment on tickets like this without taking the dive and committing. I don't
mean that to sound like a rant. 3.

Storing documents bodies as raw JSON binaries instead of serialized JSON terms
--

Key: COUCHDB-1092
URL: https://issues.apache.org/jira/browse/COUCHDB-1092
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

Currently we store documents as Erlang serialized (via the term_to_binary/1
BIF) EJSON.
The proposed patch changes the database file format so that instead of
storing serialized
EJSON document bodies, it stores raw JSON binaries.
The github branch is at:
https://github.com/fdmanana/couchdb/tree/raw_json_docs
Advantages:
* what we write to disk is much smaller - a raw JSON binary can easily get up
to 50% smaller
(at least according to the tests I did)
* when serving documents to a client we no longer need to JSON encode the
document body
read from the disk - this applies to individual document requests, view
queries with
?include_docs=true, pull and push replications, and possibly other use
cases.
We just grab its body and prepend the _id, _rev and all the necessary
metadata fields
(this is via simple Erlang binary operations)
* we avoid the EJSON term copying between request handlers and the db updater
processes,
between the work queues and the view updater process, between replicator
processes, etc
* before sending a document to the JavaScript view server, we no longer need
to convert it
from EJSON to JSON
The changes done to the document write workflow are minimalist - after JSON
decoding the
document's JSON into EJSON and removing the metadata top level fields (_id,
_rev, etc), it
JSON encodes the resulting EJSON body into a binary - this consumes CPU of
course but it
brings 2 advantages:
1) we avoid the EJSON copy between the request process and the database
updater process -
for any realistic document size (4kb or more) this can be very expensive,
specially
when there are many nested structures (lists inside objects inside lists,
etc)
2) before writing anything to the file, we do a term_to_binary([Len, Md5,
TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary
as the input
is very fast compared to a term_to_binary call with EJSON as input (or
some other nested
structure)
I think both compensate the JSON encoding after the separation of meta data
fields and non-meta data fields.
The following relaximation graph, for documents with sizes of 4Kb, shows a
significant
performance increase both for writes and reads - especially reads.
http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
I've also made a few tests to see how much the improvement is when querying a
view, for the
first time, without ?stale=ok. The size difference of the databases (after
compaction) is
also very significant - this change can reduce the size at least 50% in
common cases.
The test databases were created in an instance built from that experimental
branch.
Then they were replicated into a CouchDB instance built from the current
trunk.
At the end both databases were compacted (to fairly compare their final
sizes).
The databases contain the following view:
{
_id: _design/test,
language: javascript,
views: {
simple: {
map: function(doc) { emit(doc.float1, doc.strings[1]); }
}
}
}
##

Re: [jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Robert Dionne

On Mar 18, 2011, at 2:08 PM, Randall Leeds (JIRA) wrote:

[
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008555#comment-13008555
]

Randall Leeds commented on COUCHDB-1092:

I love a good bike shed more than most, but I've stayed pretty quiet since my
first comment because I wanted to think hard about what Paul was saying.
In the end, I agree with the last comment. I would be happy to trust the md5
and not validate on the way out _only_ so long as we close the API for
manipulating docs and validate on the way in. Paul, if I understand
correctly, this sort of change should make you rest easy.

I've also been watching this thread with no comment, but would +1 your proposal
if I understand it correctly. I think the main concern is summarized in Paul's
last post (Paul tell me to shut up if I'm wrong):

The concern I want to see addressed is avoiding the requirement that we rely
on JSON data being specifically formatted while exposing that value as editable
to client code. -- davisp

Essentially the code isn't architected properly to support this change without
adding the risk of data corruption, and any amount of that is bad. Your
proposal Randall is to go forward with it subject to the constraint that more
refactoring is done to clean up the APIs before it's published. If so then I'd
say go for it. More frequent releases and more progress would be valuable. I've
seen a lot of forks and good ideas on github and would love to see more of it
on trunk, .eg. Paul's btree cleanup.

The internal API change would mean more code refactoring, but we shouldn't be
afraid of that.
The agile way forward, if people agree that this solution is prudent, would
be to commit to trunk and open a blocking ticket to close down the document
body API before release.

Trunk is trunk, lets iterate on it. We haven't even shipped 1.1 yet! We could
even branch a feature frozen trunk for 1.2 and drop this on trunk targeted
for 1.3.
I'd love to see the 1.2 cycle stay short and in general to have more frequent
releases. It's something I feel we talk about a lot but then we sit around
and comment on tickets like this without taking the dive and committing. I
don't mean that to sound like a rant. 3.

Storing documents bodies as raw JSON binaries instead of serialized JSON
terms
--

Key: COUCHDB-1092
URL: https://issues.apache.org/jira/browse/COUCHDB-1092
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Paul Joseph Davis (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008570#comment-13008570
 ] 

Paul Joseph Davis commented on COUCHDB-1092:


@Randall and @Bon-on-dev,

You've both summarized my concerns and thoughts on how to address them. I'd be 
a bit hesitant to commit the non-private-api version to trunk with the 
expectation that it gets fixed because those sorts of things have a habit of 
never getting resolved. Though I wouldn't argue to forcefully against 
preventing it if everyone is on board with that approach.

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple Erlang binary operations)
 * we avoid the EJSON term copying between request handlers and the db updater 
 processes,
   between the work queues and the view updater process, between replicator 
 processes, etc
 * before sending a document to the JavaScript view server, we no longer need 
 to convert it
   from EJSON to JSON
 The changes done to the document write workflow are minimalist - after JSON 
 decoding the
 document's JSON into EJSON and removing the metadata top level fields (_id, 
 _rev, etc), it
 JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
 course but it
 brings 2 advantages:
 1) we avoid the EJSON copy between the request process and the database 
 updater process -
for any realistic document size (4kb or more) this can be very expensive, 
 specially
when there are many nested structures (lists inside objects inside lists, 
 etc)
 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
 TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary 
 as the input
is very fast compared to a term_to_binary call with EJSON as input (or 
 some other nested
structure)
 I think both compensate the JSON encoding after the separation of meta data 
 fields and non-meta data fields.
 The following relaximation graph, for documents with sizes of 4Kb, shows a 
 significant
 performance increase both for writes and reads - especially reads.   
 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
 I've also made a few tests to see how much the improvement is when querying a 
 view, for the
 first time, without ?stale=ok. The size difference of the databases (after 
 compaction) is
 also very significant - this change can reduce the size at least 50% in 
 common cases.
 The test databases were created in an instance built from that experimental 
 branch.
 Then they were replicated into a CouchDB instance built from the current 
 trunk.
 At the end both databases were compacted (to fairly compare their final 
 sizes).
 The databases contain the following view:
 {
 _id: _design/test,
 language: javascript,
 views: {
 simple: {
 map: function(doc) { emit(doc.float1, doc.strings[1]); }
 }
 }
 }
 ## Database with 500 000 docs of 2.5Kb each
 Document template is at:  
 https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json
 Sizes (branch vs trunk):
 $ du -m couchdb/tmp/lib/disk_json_test.couch 
 1996  couchdb/tmp/lib/disk_json_test.couch
 $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 
 2693  couchdb-trunk/tmp/lib/disk_ejson_test.couch
 Time, from a user's perpective, to build the view index from scratch:
 $ time curl 
 http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[
 {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6}
 ]}
 real  6m6.740s
 user  0m0.016s
 sys   0m0.008s
 $ time curl

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Benoit Chesneau (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008573#comment-13008573
 ] 

Benoit Chesneau commented on COUCHDB-1092:
--

maybe there could be a new branch for that? Also I would prefer correctness 
about optimisation too at this level. Even if we don't support erlang api 
(though I wonder what does it means in erlang) it's better to make sure 
everything is good at the lowest level possible.

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple Erlang binary operations)
 * we avoid the EJSON term copying between request handlers and the db updater 
 processes,
   between the work queues and the view updater process, between replicator 
 processes, etc
 * before sending a document to the JavaScript view server, we no longer need 
 to convert it
   from EJSON to JSON
 The changes done to the document write workflow are minimalist - after JSON 
 decoding the
 document's JSON into EJSON and removing the metadata top level fields (_id, 
 _rev, etc), it
 JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
 course but it
 brings 2 advantages:
 1) we avoid the EJSON copy between the request process and the database 
 updater process -
for any realistic document size (4kb or more) this can be very expensive, 
 specially
when there are many nested structures (lists inside objects inside lists, 
 etc)
 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
 TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary 
 as the input
is very fast compared to a term_to_binary call with EJSON as input (or 
 some other nested
structure)
 I think both compensate the JSON encoding after the separation of meta data 
 fields and non-meta data fields.
 The following relaximation graph, for documents with sizes of 4Kb, shows a 
 significant
 performance increase both for writes and reads - especially reads.   
 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
 I've also made a few tests to see how much the improvement is when querying a 
 view, for the
 first time, without ?stale=ok. The size difference of the databases (after 
 compaction) is
 also very significant - this change can reduce the size at least 50% in 
 common cases.
 The test databases were created in an instance built from that experimental 
 branch.
 Then they were replicated into a CouchDB instance built from the current 
 trunk.
 At the end both databases were compacted (to fairly compare their final 
 sizes).
 The databases contain the following view:
 {
 _id: _design/test,
 language: javascript,
 views: {
 simple: {
 map: function(doc) { emit(doc.float1, doc.strings[1]); }
 }
 }
 }
 ## Database with 500 000 docs of 2.5Kb each
 Document template is at:  
 https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json
 Sizes (branch vs trunk):
 $ du -m couchdb/tmp/lib/disk_json_test.couch 
 1996  couchdb/tmp/lib/disk_json_test.couch
 $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 
 2693  couchdb-trunk/tmp/lib/disk_ejson_test.couch
 Time, from a user's perpective, to build the view index from scratch:
 $ time curl 
 http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[
 {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6}
 ]}
 real  6m6.740s
 user  0m0.016s
 sys   0m0.008s
 $ time curl 
 http://localhost:5985/disk_ejson_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Randall Leeds (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008576#comment-13008576
 ] 

Randall Leeds commented on COUCHDB-1092:


I'm +0 on a new branch for this.
On one hand, that's a pretty good way to handle iterating on a feature.
On the other hand, I think it's pretty clear we love the performance and space 
savings we're seeing and I think putting it on trunk is a good way to commit 
(pun intended) to following through. We don't release with blocking issues in 
JIRA. If it were on trunk on blocking release I would have little fear of it 
languishing.

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple Erlang binary operations)
 * we avoid the EJSON term copying between request handlers and the db updater 
 processes,
   between the work queues and the view updater process, between replicator 
 processes, etc
 * before sending a document to the JavaScript view server, we no longer need 
 to convert it
   from EJSON to JSON
 The changes done to the document write workflow are minimalist - after JSON 
 decoding the
 document's JSON into EJSON and removing the metadata top level fields (_id, 
 _rev, etc), it
 JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
 course but it
 brings 2 advantages:
 1) we avoid the EJSON copy between the request process and the database 
 updater process -
for any realistic document size (4kb or more) this can be very expensive, 
 specially
when there are many nested structures (lists inside objects inside lists, 
 etc)
 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
 TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary 
 as the input
is very fast compared to a term_to_binary call with EJSON as input (or 
 some other nested
structure)
 I think both compensate the JSON encoding after the separation of meta data 
 fields and non-meta data fields.
 The following relaximation graph, for documents with sizes of 4Kb, shows a 
 significant
 performance increase both for writes and reads - especially reads.   
 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
 I've also made a few tests to see how much the improvement is when querying a 
 view, for the
 first time, without ?stale=ok. The size difference of the databases (after 
 compaction) is
 also very significant - this change can reduce the size at least 50% in 
 common cases.
 The test databases were created in an instance built from that experimental 
 branch.
 Then they were replicated into a CouchDB instance built from the current 
 trunk.
 At the end both databases were compacted (to fairly compare their final 
 sizes).
 The databases contain the following view:
 {
 _id: _design/test,
 language: javascript,
 views: {
 simple: {
 map: function(doc) { emit(doc.float1, doc.strings[1]); }
 }
 }
 }
 ## Database with 500 000 docs of 2.5Kb each
 Document template is at:  
 https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json
 Sizes (branch vs trunk):
 $ du -m couchdb/tmp/lib/disk_json_test.couch 
 1996  couchdb/tmp/lib/disk_json_test.couch
 $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 
 2693  couchdb-trunk/tmp/lib/disk_ejson_test.couch
 Time, from a user's perpective, to build the view index from scratch:
 $ time curl 
 http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[
 {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6}
 ]}
 real  6m6.740s
 user  0m0.016s
 sys   0m0.008s
 $ time curl

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

2011-03-18 Thread Benoit Chesneau (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008578#comment-13008578
 ] 

Benoit Chesneau commented on COUCHDB-1092:
--

i'm +1 ofor using trunk as trunk (should be another ticket) , but at some point 
we should all agree on one policy. Last commits tend to show that some here are 
preferring using tickets with patches .

 Storing documents bodies as raw JSON binaries instead of serialized JSON terms
 --

 Key: COUCHDB-1092
 URL: https://issues.apache.org/jira/browse/COUCHDB-1092
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Filipe Manana
Assignee: Filipe Manana

 Currently we store documents as Erlang serialized (via the term_to_binary/1 
 BIF) EJSON.
 The proposed patch changes the database file format so that instead of 
 storing serialized
 EJSON document bodies, it stores raw JSON binaries.
 The github branch is at:  
 https://github.com/fdmanana/couchdb/tree/raw_json_docs
 Advantages:
 * what we write to disk is much smaller - a raw JSON binary can easily get up 
 to 50% smaller
   (at least according to the tests I did)
 * when serving documents to a client we no longer need to JSON encode the 
 document body
   read from the disk - this applies to individual document requests, view 
 queries with
   ?include_docs=true, pull and push replications, and possibly other use 
 cases.
   We just grab its body and prepend the _id, _rev and all the necessary 
 metadata fields
   (this is via simple Erlang binary operations)
 * we avoid the EJSON term copying between request handlers and the db updater 
 processes,
   between the work queues and the view updater process, between replicator 
 processes, etc
 * before sending a document to the JavaScript view server, we no longer need 
 to convert it
   from EJSON to JSON
 The changes done to the document write workflow are minimalist - after JSON 
 decoding the
 document's JSON into EJSON and removing the metadata top level fields (_id, 
 _rev, etc), it
 JSON encodes the resulting EJSON body into a binary - this consumes CPU of 
 course but it
 brings 2 advantages:
 1) we avoid the EJSON copy between the request process and the database 
 updater process -
for any realistic document size (4kb or more) this can be very expensive, 
 specially
when there are many nested structures (lists inside objects inside lists, 
 etc)
 2) before writing anything to the file, we do a term_to_binary([Len, Md5, 
 TheThingToWrite])
and then write the result to the file. A term_to_binary call with a binary 
 as the input
is very fast compared to a term_to_binary call with EJSON as input (or 
 some other nested
structure)
 I think both compensate the JSON encoding after the separation of meta data 
 fields and non-meta data fields.
 The following relaximation graph, for documents with sizes of 4Kb, shows a 
 significant
 performance increase both for writes and reads - especially reads.   
 http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef63400b94f
 I've also made a few tests to see how much the improvement is when querying a 
 view, for the
 first time, without ?stale=ok. The size difference of the databases (after 
 compaction) is
 also very significant - this change can reduce the size at least 50% in 
 common cases.
 The test databases were created in an instance built from that experimental 
 branch.
 Then they were replicated into a CouchDB instance built from the current 
 trunk.
 At the end both databases were compacted (to fairly compare their final 
 sizes).
 The databases contain the following view:
 {
 _id: _design/test,
 language: javascript,
 views: {
 simple: {
 map: function(doc) { emit(doc.float1, doc.strings[1]); }
 }
 }
 }
 ## Database with 500 000 docs of 2.5Kb each
 Document template is at:  
 https://github.com/fdmanana/couchdb/blob/raw_json_docs/doc_2_5k.json
 Sizes (branch vs trunk):
 $ du -m couchdb/tmp/lib/disk_json_test.couch 
 1996  couchdb/tmp/lib/disk_json_test.couch
 $ du -m couchdb-trunk/tmp/lib/disk_ejson_test.couch 
 2693  couchdb-trunk/tmp/lib/disk_ejson_test.couch
 Time, from a user's perpective, to build the view index from scratch:
 $ time curl 
 http://localhost:5984/disk_json_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[
 {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6}
 ]}
 real  6m6.740s
 user  0m0.016s
 sys   0m0.008s
 $ time curl 
 http://localhost:5985/disk_ejson_test/_design/test/_view/simple?limit=1
 {total_rows:50,offset:0,rows:[
 {id:076a-c1ae-4999-b508-c03f4d0620c5,key:null,value:wfxuF3N8XEK6}
 ]}
 real  15m41.439s
 user  0m0.012s
 sys   0m0.012s
 ## Database

Re: CouchDB exceptions

2011-03-18 Thread kowsik

That was quick! Filed in JIRA:
https://issues.apache.org/jira/browse/COUCHDB-1093

Thanks Filipe,

K.
---
http://blitz.io
http://twitter.com/pcapr

On Fri, Mar 18, 2011 at 10:54 AM, Filipe David Manana
fdman...@apache.org wrote:
 Ah,

 I think the issue is while we are folding the by sequence btree, we
 are not checking if the database file changed. So if compaction
 finishes before finishing the btree fold, we reach that error.
 I can't see right now any other situation, involving _changes, that
 might cause that issue.


 On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote:
 Been seeing this on our production CouchDB's (1.0.2) sporadically. We
 are using the _changes feed, background view indexing and automatic
 compaction.

 Uncaught error in HTTP request: {exit,
                                 {noproc,
                                  {gen_server,call,
                                   [0.1478.0,
                                    {pread_iolist,290916},
                                    infinity]}}}

  Stacktrace: [{gen_server,call,3},
             {couch_file,pread_iolist,2},
             {couch_file,pread_binary,2},
             {couch_file,pread_term,2},
             {couch_db,make_doc,5},
             {couch_db,open_doc_int,3},
             {couch_db,open_doc,3},
             {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}]

 Not reproducible yet, but it seems compacting while there are active
 _changes listeners seems to trigger this. After the exception the
 _changes listeners are disconnected which then connect back and
 everything goes back to normal. beam itself holds up, though last
 night it terminated with no logs, nothing. Just poof. Any ideas?

 Thanks,

 K.
 ---
 http://blitz.io
 http://twitter.com/pcapr




 --
 Filipe David Manana,
 fdman...@gmail.com, fdman...@apache.org

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.

[jira] Created: (COUCHDB-1093) Exceptions related to _changes + compact

2011-03-18 Thread kowsik (JIRA)

Exceptions related to _changes + compact


 Key: COUCHDB-1093
 URL: https://issues.apache.org/jira/browse/COUCHDB-1093
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 1.0.2
 Environment: I don't believe this is OS and/or hardware elated, but 
I'm running on a redhat 32-bit linux kernel
Reporter: kowsik


From the last thread on the dev mailing list:

On Fri, Mar 18, 2011 at 10:54 AM, Filipe David Manana fdman...@apache.org 
wrote:
 Ah,

 I think the issue is while we are folding the by sequence btree, we
 are not checking if the database file changed. So if compaction
 finishes before finishing the btree fold, we reach that error.
 I can't see right now any other situation, involving _changes, that
 might cause that issue.


 On Fri, Mar 18, 2011 at 5:40 PM, kowsik kow...@gmail.com wrote:
 Been seeing this on our production CouchDB's (1.0.2) sporadically. We
 are using the _changes feed, background view indexing and automatic
 compaction.

 Uncaught error in HTTP request: {exit,
                                 {noproc,
                                  {gen_server,call,
                                   [0.1478.0,
                                    {pread_iolist,290916},
                                    infinity]}}}

  Stacktrace: [{gen_server,call,3},
             {couch_file,pread_iolist,2},
             {couch_file,pread_binary,2},
             {couch_file,pread_term,2},
             {couch_db,make_doc,5},
             {couch_db,open_doc_int,3},
             {couch_db,open_doc,3},
             {couch_changes,'-make_filter_fun/4-lc$^4/1-3-',2}]

 Not reproducible yet, but it seems compacting while there are active
 _changes listeners seems to trigger this. After the exception the
 _changes listeners are disconnected which then connect back and
 everything goes back to normal. beam itself holds up, though last
 night it terminated with no logs, nothing. Just poof. Any ideas?

 Thanks,

 K.
 ---
 http://blitz.io
 http://twitter.com/pcapr


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (COUCHDB-867) Add http handlers for root files with special meanings, such as crossdomain.xml.

2011-03-18 Thread edward de jong (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008653#comment-13008653
 ] 

edward de jong commented on COUCHDB-867:


i am trying to get flash to connect to my database, and desperately need this 
crossdomain.xml file to get served up. can somebody please tell me how to 
create the local.ini file, and where that file is supposed to go? i don't see a 
local.ini file on my machine...


 Add http handlers for root files with special meanings, such as 
 crossdomain.xml.
 

 Key: COUCHDB-867
 URL: https://issues.apache.org/jira/browse/COUCHDB-867
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
Affects Versions: 1.0.1
Reporter: Eric Desgranges
 Attachments: handle_file_req.diff


 Some files at the root level of a website have a special meaning, such as 
 favicon.ico storing the favorite icon, which is processed correctly in the 
 [httpd_global_handlers] section of the ini file with this instruction:
 favicon.ico = {couch_httpd_misc_handlers, handle_favicon_req, 
 ../share/couchdb/www}
 But this is the only one handled while other files, which are critical when 
 to accessing the CouchDB server from Flash, Flex, Silverlight..., are missing
 - crossdomain.xml (this one should be a top priority fix!)
 - clientaccesspolicy.xml -- See 
 http://msdn.microsoft.com/en-us/library/cc838250%28v=VS.95%29.aspx#crossdomain_communication
 And there's also 'robots.txt' to prevent search engines from accessing some 
 files / directories.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Updated: (COUCHDB-994) Crash after compacting large views

[jira] Commented: (COUCHDB-994) Crash after compacting large views

CouchDB exceptions

Re: CouchDB exceptions

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

Re: [jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

[jira] Commented: (COUCHDB-1092) Storing documents bodies as raw JSON binaries instead of serialized JSON terms

Re: CouchDB exceptions

[jira] Created: (COUCHDB-1093) Exceptions related to _changes + compact

[jira] Commented: (COUCHDB-867) Add http handlers for root files with special meanings, such as crossdomain.xml.

15 matches

Site Navigation

Mail list logo

Footer information