Re: Using Google code review?
consider Crucible? It integrates into Jira and is free to OSS projects. http://www.atlassian.com/software/crucible/pricing.jsp Crucible is free for use by official non-profit organisations, charities and open source projects. On Wed, Jan 13, 2010 at 12:53 PM, Noah Slater nsla...@apache.org wrote: Hey, Do you think there's anyway to integrate Google code review with JIRA? He's an example I just plucked from the front page: http://codereview.appspot.com/186119/show Thoughts? Noah
[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs
[ https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799769#action_12799769 ] Paul Joseph Davis commented on COUCHDB-620: --- The error reporting issue is that if you've got four docs in the pipeline, and the process dies, then its hard to tell which document caused the error. And generally retrying will just cause another error. Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs -- Key: COUCHDB-620 URL: https://issues.apache.org/jira/browse/COUCHDB-620 Project: CouchDB Issue Type: Improvement Components: Infrastructure Affects Versions: 0.10 Environment: Ubuntu 9.10 64 bit, CouchDB 0.10 Reporter: Roger Binns Assignee: Damien Katz Attachments: pipelining.jpg Generating views is extremely slow. For example adding 10 million documents takes less than 10 minutes but generating some simple views on the same docs takes over 4 hours. Using top you can see that CouchDB (erlang) and couchjs between them cannot even saturate a single CPU let alone the I/O system. Under ideal conditions performance should be limited by cpu, disk or memory. This implies that the processes are doing simple things in lockstep accumulating latencies in each process as well as the communication between them which when multiplied by the number of documents can amount to a lot. Some suggestions: * Run as many couchjs instances as there are processor cores and scatter work amongst them * Have some sort of pipelining in the erlang so that the moment the first byte of response is received from couchjs the data is sent for the next request (the JSON conversion, HTTP headers etc should all have been assembled already) to reduce latencies. Do whatever is most similar in couchjs (eg use separate threads to read requests, process them and write responses). * Use the equivalent of HTTP pipelining when talking to couchjs so that it always has a doc ready to work on rather than having to transmit an entire response and then wait for erlang to think and provide an entire new request A simple test of success is to have a database with a million or so documents with a trivial view and have view creation max out the CPU,. memory or disk. Some things in CouchDB make this a particularly nasty problem. View data is not replicated so replicating documents can lead the view data by a large margin on the recipient database. This can lead to inconsistencies. You also can't expect users to then wait minutes (or hours) for a request to complete because the view generation got that far behind. (My own plans now are to not use replication and instead create the database file on another couchdb instance and then rsync the binary database file over instead!) Although stale=ok is available, you still have no idea if the response will be quick or take however long view generation does. (Sure I could add some sort of timeout and complicate the code but then what value do I pick? If I have a user waiting I want an answer ASAP or I have to give them some horrible error message. Taking a long wait and then giving a timeout is even worse!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client
[ https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799829#action_12799829 ] Paul Joseph Davis commented on COUCHDB-583: --- Just some quick thoughts reading through the diff: I'm not a fan of the file containing a list of compressible types. There are too many types that will just make that configuration hard. Not to mention exposing an entirely new API endpoint to work with those types is also needlessly complex. I'd prefer to see an automatic test trying to compress the first 4K or so of an attachment and use a heuristic to determine whether it compressed enough to justify compressing the entire attachment. If that's not doable, the compressible type system should be integrated into the current configuration mechanism. For testing from FireFox it might be best to expose a attachment is stored in compressed form attribute in the _attachments member. Passing around the Y and N binaries as a flag for an attachment being compressed is un-erlangy. true and false atoms would be better. Test code does not belong in couch_httpd.erl. Is there something I'm missing on why we need to leak couch_util:gzip* functions into couch_httpd_db.erl instead of putting all of that logic into couch_stream.erl? Is there nothing in mochiweb to handle accept-encoding parsing? Instead of naming tests test1 - test17 and comments above each test, just use a descriptive test name. It might help to group related tests as well so that tests are easier to find. Data in the etap tests shouldn't be stored inline when its that big. Create data files and use the test helpers to reference the filenames and read from disk. storing attachments in compressed form and serving them in compressed form if accepted by the client Key: COUCHDB-583 URL: https://issues.apache.org/jira/browse/COUCHDB-583 Project: CouchDB Issue Type: New Feature Components: Database Core, HTTP Interface Environment: CouchDB trunk Reporter: Filipe Manana Attachments: couchdb-583-trunk-3rd-try.patch, couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch This feature allows Couch to gzip compress attachments as they are being received and store them in compressed form. When a client asks for downloading an attachment (e.g. GET somedb/somedoc/attachment.txt), the attachment is sent in compressed form if the client's http request has gzip specified as a valid transfer encoding for the response (using the http header Accept-Encoding). Otherwise couch decompresses the attachment before sending it back to the client. Attachments are compressed only if their MIME type matches one of those listed in a separate config file. Compression level is also configurable in the default.ini file. This follows Damien's suggestion from 30 November: Perhaps we need a separate user editable ini file to specify compressable or non-compressable files (would probably be too big for the regular ini file). What do other web servers do? Also, a potential optimization is to compress the file while writing to disk, and serve the compressed bytes directly to clients that can handle it, and decompressed for those that can't. For compressable types, it's a win for both disk IO for reads and writes, and CPU on read. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (COUCHDB-623) File format for views is space and time inefficient - use a better one
File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: openid 1.1 authentication handler
Hi. As suggested by chris on the user list, it could be interesting to integrate the openid handler in the couch. I guess before going any further there should be some discussion here and I'll probably need some suggestions about code guidelines and prepping the makefiles. I should also mention that the goal is to have the couch work both as openid client and endpoint. cheers On Wed, Jan 13, 2010 at 6:03 PM, Chris Anderson jch...@apache.org wrote: On Wed, Jan 13, 2010 at 9:20 AM, Matteo Caprari matteo.capr...@gmail.com wrote: Hi. I've released an authentication handler that adds support for authenticating with openid 1.1. It works but needs to be stressed a bit. Source and readme: http://github.com/mcaprari/couchdb-openid blogged (copied the readme): http://caprazzi.net/posts/openid-authentication-handler-for-couchdb/ This looks really cool. If you want to work on getting it into CouchDB (might require some cleanup) you should bring it up on the dev list, and put the patch into Jira: http://issues.apache.org/jira/browse/COUCHDB Chris cheers -- :Matteo Caprari matteo.capr...@gmail.com -- Chris Anderson http://jchrisa.net http://couch.io -- :Matteo Caprari matteo.capr...@gmail.com
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799864#action_12799864 ] Chris Anderson commented on COUCHDB-623: It's worth nothing that another advantage to using the storage btrees is the MVCC guarantees. This means that a slow client can take its sweet time to traverse the view index, and is not affected by ongoing writes or deletes. This is crucial for the consistency guarantees views make. It is not very hard to create alternate view index systems (like CouchDB-Lounge) and the overhead of running as an external is negligible. One fine way to prototype a view system that optimizes for different things would be as an external. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799878#action_12799878 ] Roger Binns commented on COUCHDB-623: - What are the consistency guarantees that views make? I can't find any documentation about it anywhere! (There is plenty about the main db, but nothing about views.) I can't see any that you can make as the view data is derived from the documents and the documents can be changed at any point. For example while the first row of a view is being returned the same corresponding document could have been deleted. The slow client example can also lead to inconsistent data - for example it may update a document on one connection and then access the view on a second connection and due to timing end up with the view not including that document. The only consistency guarantee I can see is that if you do not add/change/delete the documents for the period shortly before and then during view retrieval until the view is completely retrieved then the view will reflect the documents correctly at that time. If there is any form of concurrency between the documents and the views then there cannot be guarantees unless CouchDB introduced a transactioning system. I do see how the append only btree/mvcc format makes the view retrieval code easier to write, but users of CouchDB do not care how hard the code is to write :-) File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799891#action_12799891 ] Paul Joseph Davis commented on COUCHDB-623: --- The consistency guarantee refers to the file format used guarantees on disk consistency the same as is done for the main database file (ie, tail append MVCC style). Its not a reference to figuring out the sync between the main db and the view. As you point out doing things like querying with stale=ok can give you a view result that does not reflect the most recent changes to the database or reflects changes from other clients etc etc. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896 ] Adam Kocoloski commented on COUCHDB-623: I believe by consistency guarantees Chris meant that a view request uses a single snapshot of the view index for the entire response. Even if documents are changed in the interim, and even if someone else has triggered a view update, your response will still accurately reflect the state of the DB at a single moment in time. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
On Wed, Jan 13, 2010 at 11:34 AM, Adam Kocoloski (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896 ] Adam Kocoloski commented on COUCHDB-623: I believe by consistency guarantees Chris meant that a view request uses a single snapshot of the view index for the entire response. Even if documents are changed in the interim, and even if someone else has triggered a view update, your response will still accurately reflect the state of the DB at a single moment in time. Thanks Adam, that's exactly what I'm talking about. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Chris Anderson http://jchrisa.net http://couch.io
[jira] Closed: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damien Katz closed COUCHDB-623. --- Resolution: Invalid Assignee: Damien Katz Closing as Invalid this has no objective criteria for being resolved. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns Assignee: Damien Katz This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs
Lets have this discussion on the dev mailing list. That's what it's for. -Damien On Jan 10, 2010, at 9:27 PM, Roger Binns (JIRA) wrote: Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs -- Key: COUCHDB-620 URL: https://issues.apache.org/jira/browse/COUCHDB-620 Project: CouchDB Issue Type: Improvement Components: Infrastructure Affects Versions: 0.10 Environment: Ubuntu 9.10 64 bit, CouchDB 0.10 Reporter: Roger Binns Generating views is extremely slow. For example adding 10 million documents takes less than 10 minutes but generating some simple views on the same docs takes over 4 hours. Using top you can see that CouchDB (erlang) and couchjs between them cannot even saturate a single CPU let alone the I/O system. Under ideal conditions performance should be limited by cpu, disk or memory. This implies that the processes are doing simple things in lockstep accumulating latencies in each process as well as the communication between them which when multiplied by the number of documents can amount to a lot. Some suggestions: * Run as many couchjs instances as there are processor cores and scatter work amongst them * Have some sort of pipelining in the erlang so that the moment the first byte of response is received from couchjs the data is sent for the next request (the JSON conversion, HTTP headers etc should all have been assembled already) to reduce latencies. Do whatever is most similar in couchjs (eg use separate threads to read requests, process them and write responses). * Use the equivalent of HTTP pipelining when talking to couchjs so that it always has a doc ready to work on rather than having to transmit an entire response and then wait for erlang to think and provide an entire new request A simple test of success is to have a database with a million or so documents with a trivial view and have view creation max out the CPU,. memory or disk. Some things in CouchDB make this a particularly nasty problem. View data is not replicated so replicating documents can lead the view data by a large margin on the recipient database. This can lead to inconsistencies. You also can't expect users to then wait minutes (or hours) for a request to complete because the view generation got that far behind. (My own plans now are to not use replication and instead create the database file on another couchdb instance and then rsync the binary database file over instead!) Although stale=ok is available, you still have no idea if the response will be quick or take however long view generation does. (Sure I could add some sort of timeout and complicate the code but then what value do I pick? If I have a user waiting I want an answer ASAP or I have to give them some horrible error message. Taking a long wait and then giving a timeout is even worse!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799916#action_12799916 ] Roger Binns commented on COUCHDB-623: - Not again Damien :-) Simple criteria - the size of the view file should be proportionate to the data in a view on initial generation. If you want raw numbers, the view file should be no larger than double the sum of JSON encoded key, value and _id for each row. The current multiplier is 15 to 27 times as much which is ludicrous. Even post compactation the file is a little on the large side. And because the view results are not replicated, the overhead has to be incurred on every machine that replication happens to. Or put another way, if people are planning on deploying CouchDB how much space would you advise them to provision? When I started, the answer for 10million documents/2.5GB of raw JSON is 72GB: 23GB for DB, another 21GB for the compacted version, 27+GB for view file, another 1+GB for compacted view file By shortening ids to 4 bytes instead of 16 we get: 4GB for DB, another 4GB for compacted, 27GB for view file, another 1GB for compacted view file By being able to sort my documents to be ordered by the most commonly emitted view key: 4GB for DB, another 4GB for compacted, 15GB for view file, another 1GB for compacted view file Since the view/DB coexists at the same time as the compaction you need space for both simultaneously. 10 million documents/2GB of data is not something that makes any existing database system sweat. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns Assignee: Damien Katz This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client
[ https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799922#action_12799922 ] Filipe Manana commented on COUCHDB-583: --- Hi Paul, thanks for you're feedback. Passing around the Y and N binaries as a flag for an attachment being compressed is un-erlangy. true and false atoms would be better. Well, this was mostly because I read somewhere in Armstrong's book that binaries are preferred (more efficient) for IO operations (network, disk storage). But I agree, using true / false atoms is more readable. Is there nothing in mochiweb to handle accept-encoding parsing? I don't think so, at least in the mochiweb included with couch. It's probably better to move this accept-encoding parsing functions, and the respective test functions, into the mochiweb sources. I'll get back to work and enhance the patch following your remarks. cheers storing attachments in compressed form and serving them in compressed form if accepted by the client Key: COUCHDB-583 URL: https://issues.apache.org/jira/browse/COUCHDB-583 Project: CouchDB Issue Type: New Feature Components: Database Core, HTTP Interface Environment: CouchDB trunk Reporter: Filipe Manana Attachments: couchdb-583-trunk-3rd-try.patch, couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch This feature allows Couch to gzip compress attachments as they are being received and store them in compressed form. When a client asks for downloading an attachment (e.g. GET somedb/somedoc/attachment.txt), the attachment is sent in compressed form if the client's http request has gzip specified as a valid transfer encoding for the response (using the http header Accept-Encoding). Otherwise couch decompresses the attachment before sending it back to the client. Attachments are compressed only if their MIME type matches one of those listed in a separate config file. Compression level is also configurable in the default.ini file. This follows Damien's suggestion from 30 November: Perhaps we need a separate user editable ini file to specify compressable or non-compressable files (would probably be too big for the regular ini file). What do other web servers do? Also, a potential optimization is to compress the file while writing to disk, and serve the compressed bytes directly to clients that can handle it, and decompressed for those that can't. For compressable types, it's a win for both disk IO for reads and writes, and CPU on read. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client
[ https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799937#action_12799937 ] Damien Katz commented on COUCHDB-583: - I haven't looked at the patch, but I agree with most of Paul comments, except for figuring out when to compress files. Lots of compressed files might have uncompressed headers in the file, leading to unnecessary compression. MP3s with id3v2 tags immediately come to mind. storing attachments in compressed form and serving them in compressed form if accepted by the client Key: COUCHDB-583 URL: https://issues.apache.org/jira/browse/COUCHDB-583 Project: CouchDB Issue Type: New Feature Components: Database Core, HTTP Interface Environment: CouchDB trunk Reporter: Filipe Manana Attachments: couchdb-583-trunk-3rd-try.patch, couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch This feature allows Couch to gzip compress attachments as they are being received and store them in compressed form. When a client asks for downloading an attachment (e.g. GET somedb/somedoc/attachment.txt), the attachment is sent in compressed form if the client's http request has gzip specified as a valid transfer encoding for the response (using the http header Accept-Encoding). Otherwise couch decompresses the attachment before sending it back to the client. Attachments are compressed only if their MIME type matches one of those listed in a separate config file. Compression level is also configurable in the default.ini file. This follows Damien's suggestion from 30 November: Perhaps we need a separate user editable ini file to specify compressable or non-compressable files (would probably be too big for the regular ini file). What do other web servers do? Also, a potential optimization is to compress the file while writing to disk, and serve the compressed bytes directly to clients that can handle it, and decompressed for those that can't. For compressable types, it's a win for both disk IO for reads and writes, and CPU on read. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client
[ https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799947#action_12799947 ] Filipe Manana commented on COUCHDB-583: --- Hum, Lets open a votation :) 1) use an heuristic, as suggested by Paul 2) or a file listing the mime types worth compressing 3) some other alternative? cheers storing attachments in compressed form and serving them in compressed form if accepted by the client Key: COUCHDB-583 URL: https://issues.apache.org/jira/browse/COUCHDB-583 Project: CouchDB Issue Type: New Feature Components: Database Core, HTTP Interface Environment: CouchDB trunk Reporter: Filipe Manana Attachments: couchdb-583-trunk-3rd-try.patch, couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch This feature allows Couch to gzip compress attachments as they are being received and store them in compressed form. When a client asks for downloading an attachment (e.g. GET somedb/somedoc/attachment.txt), the attachment is sent in compressed form if the client's http request has gzip specified as a valid transfer encoding for the response (using the http header Accept-Encoding). Otherwise couch decompresses the attachment before sending it back to the client. Attachments are compressed only if their MIME type matches one of those listed in a separate config file. Compression level is also configurable in the default.ini file. This follows Damien's suggestion from 30 November: Perhaps we need a separate user editable ini file to specify compressable or non-compressable files (would probably be too big for the regular ini file). What do other web servers do? Also, a potential optimization is to compress the file while writing to disk, and serve the compressed bytes directly to clients that can handle it, and decompressed for those that can't. For compressable types, it's a win for both disk IO for reads and writes, and CPU on read. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client
[ https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799948#action_12799948 ] Paul Joseph Davis commented on COUCHDB-583: --- Hrm, 4KiB of headers even? That is a good point though. But I'd still be quite hesitant to make it a whitelist of of content types to compress. Unless maybe we allowed text/* or similar. Or perhaps is should be a blacklist that could do the * match? storing attachments in compressed form and serving them in compressed form if accepted by the client Key: COUCHDB-583 URL: https://issues.apache.org/jira/browse/COUCHDB-583 Project: CouchDB Issue Type: New Feature Components: Database Core, HTTP Interface Environment: CouchDB trunk Reporter: Filipe Manana Attachments: couchdb-583-trunk-3rd-try.patch, couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch This feature allows Couch to gzip compress attachments as they are being received and store them in compressed form. When a client asks for downloading an attachment (e.g. GET somedb/somedoc/attachment.txt), the attachment is sent in compressed form if the client's http request has gzip specified as a valid transfer encoding for the response (using the http header Accept-Encoding). Otherwise couch decompresses the attachment before sending it back to the client. Attachments are compressed only if their MIME type matches one of those listed in a separate config file. Compression level is also configurable in the default.ini file. This follows Damien's suggestion from 30 November: Perhaps we need a separate user editable ini file to specify compressable or non-compressable files (would probably be too big for the regular ini file). What do other web servers do? Also, a potential optimization is to compress the file while writing to disk, and serve the compressed bytes directly to clients that can handle it, and decompressed for those that can't. For compressable types, it's a win for both disk IO for reads and writes, and CPU on read. Patch attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
[ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965 ] Roger Binns commented on COUCHDB-623: - The view consistency stuff is a red herring. If you are not making changes to the DB then any file format will work and give consistent results. If you are making changes to the docs then no scheme short of transactions/locking will ensure that the view is consistent with the documents. It will always be possible for documents to be referenced by the view that are not in the DB and for documents to be in the DB that are not in the view. I see no point in trying to even make the view consistent with a point in time while DB changes are happening since it gives no performance efficiency nor any space efficiency - in fact the extreme opposites. The point of views is to give me information fast that I could only otherwise obtain by visiting all the documents. That is what they should be optimized for. File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns Assignee: Damien Katz This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
On Wed, Jan 13, 2010 at 2:11 PM, Roger Binns (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965 ] Roger Binns commented on COUCHDB-623: - The view consistency stuff is a red herring. If you are not making changes to the DB then any file format will work and give consistent results. If you are making changes to the docs then no scheme short of transactions/locking will ensure that the view is consistent with the documents. It will always be possible for documents to be referenced by the view that are not in the DB and for documents to be in the DB that are not in the view. I see no point in trying to even make the view consistent with a point in time while DB changes are happening since it gives no performance efficiency nor any space efficiency - in fact the extreme opposites. The point of views is to give me information fast that I could only otherwise obtain by visiting all the documents. That is what they should be optimized for. the current views are optimized for youre red herring. here it actually matters is the ability to give transactional information about things like bank account balances. see: http://books.couchdb.org/relax/reference/recipes for Banking without MVCC views, there's no way to query accurately at all when inserts are underway (short of blocking reads during writes). If you need something with less consistency, you are encouraged to wrap your own indexing system around couchdb's map reduce runtime, or even build your own runtime. has anyone used Hadoop as an external yet? Chris File format for views is space and time inefficient - use a better one -- Key: COUCHDB-623 URL: https://issues.apache.org/jira/browse/COUCHDB-623 Project: CouchDB Issue Type: Improvement Components: Database Core Affects Versions: 0.10 Reporter: Roger Binns Assignee: Damien Katz This was discussed on the dev mailing list over the last few days and noted here so it isn't forgotten. The main database file format is optimised for data integrity - not losing or mangling documents - and rightly so. That same append-only format is also used for views where it is a poor fit. The more random the ordering of data supplied, the larger the btree. The larger the keys (in bytes) the larger the btree. As an example my 2GB of raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before compacting to 900MB). Since views are not replicated, this requires a disproportionate amount of disk space on each receiving server (not to mention I/O load). The format also affects view generation performance. By loading my documents into CouchDB in an order by the most emitted value in views I was able to reduce load time from 75 minutes to 40 minutes with the view file size being 15GB instead of 27GB, but still very distant from the 900MB post compaction. Views are a performance enhancement. They save you from having to visit every document when doing some queries. The data within in a view is generated and hence the only consequence of losing view data is a performance one and the view can be regenerated anyway. Consequently the file format should be one that is optimised for performance and size. The only integrity feature needed is the ability to tell that the view is potentially corrupt (eg the power failed while it was being generated/updated). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Chris Anderson http://jchrisa.net http://couch.io
Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Anderson wrote: see: http://books.couchdb.org/relax/reference/recipes for Banking without MVCC views, there's no way to query accurately at all when inserts are underway (short of blocking reads during writes). I am afraid I do not understand what you are saying. Sure the scheme listed in the book makes sense, but only if a transaction maps exactly to one document (which I guess is the point). Even then I still don't see the relevance. Things would only break down if the view returned partial information (eg if a single document caused two view rows to be emitted but only one of those was returned.) BTW views do not return the update_seq so as an end user you still do not know up to date it is. The file format does not need to protect each view row, but does need to do so for the main database where the unit is a document. For example the view file format could use an atomic unit of 10,000 document's view output or some number of megabytes. That unit can still be regenerated if something bad happened (a rare circumstance such as untimely power failure). If you need something with less consistency, you are encouraged to wrap your own indexing system around couchdb's map reduce runtime, or even build your own runtime. I am becoming very tempted to just dump CouchDB for SQLite with a trivial REST front end, since it appears that CouchDB is just not capable of handling 10million documents/2GB of data in anything resembling a sensible amount of disk space or compute time for the foreseeable future. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6 1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y =7gg2 -END PGP SIGNATURE-
Objective criteria
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 [I wrote a personal query to Damien which he asked me to repeat here.] Both 620 and 623 were closed by Damien because they lacked objective criteria. For both tickets there are criteria I consider objective :-) There are also suggestions on how to address the issues. Consequently my query is if the criteria are not objective enough, does Damien not agree with them, not care about the underlying issues, think that a 10 million document/2GB raw json data set is outside the scope of what CouchDB should cope with, want this stuff in the wiki etc? Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOUP4ACgkQmOOfHg372QS/hQCfSP9Edy+wrZRFwItFmDD3mNcN yyIAn2z9XvJigm2xKk/r4CgAUqZp1t/i =JMLG -END PGP SIGNATURE-
Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one
On Wed, Jan 13, 2010 at 2:54 PM, Roger Binns rog...@rogerbinns.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Anderson wrote: see: http://books.couchdb.org/relax/reference/recipes for Banking without MVCC views, there's no way to query accurately at all when inserts are underway (short of blocking reads during writes). I am afraid I do not understand what you are saying. Sure the scheme listed in the book makes sense, but only if a transaction maps exactly to one document (which I guess is the point). Even then I still don't see the relevance. Things would only break down if the view returned partial information (eg if a single document caused two view rows to be emitted but only one of those was returned.) BTW views do not return the update_seq so as an end user you still do not know up to date it is. If that would help, there are I think people working on an update_seq patch for views. The file format does not need to protect each view row, but does need to do so for the main database where the unit is a document. A reduce giving an balance for a particular account, could be effected by documents being inserted anywhere in the db. The current map reduce system guarantees that the balance returned reflects a consistent snapshot of the database, even if other operations are ongoing. (eg a given transfer will appear consistently, even if those same accounts are undergoing concurrent operations for other transfers.) We can't atomically prevent overdrafts, but how many banks do that, anyway? For example the view file format could use an atomic unit of 10,000 document's view output or some number of megabytes. That unit can still be regenerated if something bad happened (a rare circumstance such as untimely power failure). There are alternate storage systems which either use locking or avoid any consistency guarantees at all. If you need something with less consistency, you are encouraged to wrap your own indexing system around couchdb's map reduce runtime, or even build your own runtime. I am becoming very tempted to just dump CouchDB for SQLite with a trivial REST front end, since it appears that CouchDB is just not capable of handling 10million documents/2GB of data in anything resembling a sensible amount of disk space or compute time for the foreseeable future. It sounds like Couch does just fine if you run compaction. Perhaps we should recommend view compaction more aggressively. Chris Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6 1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y =7gg2 -END PGP SIGNATURE- -- Chris Anderson http://jchrisa.net http://couch.io
Re: Objective criteria
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Anderson wrote: The ticketing system should be for smaller scope issues, I think. I see it more as a don't forget about this plus somewhere for others to say this also affects me or here is additional information/angles. Obviously there is a fine line between that kind of thing and a discussion. My big concern is that the issue was hashed out here over a few days then the thread goes dead, and the issue is forgotten. A JIRA report of open issues should be a todo list of bugs to fix and improvements to make. Optimizing the view server is an agreed goal of the community. Maybe in people's heads, but it wasn't written down anywhere such as the tracker or the roadmap. In fact the front page of couchdb.org claims that Erlang allows for the CouchDB design to be scalable and the overview page makes an efficient claim in the last sentence. The current implementation is neither of these. Probably the best way to help is to take a look at all the work Damien's done in trunk (the pipelining) and perhaps the parallel writers optimization he has. BTW I have been using trunk for over a week. It is better than 0.10 that I was using before but not that much of an improvement. And changes in the way I generate some of my data have hurt me again (I can either order for _id or for view keys but not both at the same time) so my initial DB has now gone from 4GB to 15GB (I optimized for views). We could really use a way to take the benchmarks you ran, and put them into the buildbot. Sadly I can't do it with my real data because it belongs to someone else. However I hereby commit to produce a representative benchmark that is substantially similar in performance and data within the next two weeks. (Also note that there is nothing special about what I am doing - anyone with similar numbers of documents has similar issues.) I'm hoping that more can be done about the size issues soon too. (I think that addressing the size issues will help a lot since it will require way less CPU and I/O to produce and use smaller files.) Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy +3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn =rSWg -END PGP SIGNATURE-
Re: Objective criteria
I have been putting together some stuff that seems pertinent to this discussion. I'm working on a performance suite that tests a variety of concurrent performance scenarios. I have the client code written but I'm still working on the automated build/test code. Once that is finished I plan to do some GitHub integration and some charting. The idea here is to chart the performance differences between a GitHub branch at a certain commit compared to the performance of the latest release and the latest trunk. If someone has an idea of how they might increase performance they could point this tool at their GitHub branch and reference the differences in performance between their code and the latest release and trunk. I'll send another email once i have some pretty graphs to show off :) -Mikeal On Wed, Jan 13, 2010 at 5:19 PM, Damien Katz dam...@apache.org wrote: On Jan 13, 2010, at 3:37 PM, Roger Binns wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Anderson wrote: The ticketing system should be for smaller scope issues, I think. I see it more as a don't forget about this plus somewhere for others to say this also affects me or here is additional information/angles. Obviously there is a fine line between that kind of thing and a discussion. My big concern is that the issue was hashed out here over a few days then the thread goes dead, and the issue is forgotten. A JIRA report of open issues should be a todo list of bugs to fix and improvements to make. That's fine, the issue is that bugs saying It's too slow is always true for someone. Many people find the view indexing performance just fine, many do not. Since CouchDB makes no performance or size guarantees, you can't call the general performance a bug. Unless you have specific bug to fix, or enhancement to make, don't use JIRA. Use the dev mailing list to make your case, see if someone will produce a patch or find a measurable bottleneck that can be addressed. JIRA is not the place for discussions about the design of CouchDB components. Neither is IRC. Also, if you want something without view performance problems but similar to CouchDB, you should look at MongoDB. -Damien Optimizing the view server is an agreed goal of the community. Maybe in people's heads, but it wasn't written down anywhere such as the tracker or the roadmap. In fact the front page of couchdb.org claims that Erlang allows for the CouchDB design to be scalable and the overview page makes an efficient claim in the last sentence. The current implementation is neither of these. Probably the best way to help is to take a look at all the work Damien's done in trunk (the pipelining) and perhaps the parallel writers optimization he has. BTW I have been using trunk for over a week. It is better than 0.10 that I was using before but not that much of an improvement. And changes in the way I generate some of my data have hurt me again (I can either order for _id or for view keys but not both at the same time) so my initial DB has now gone from 4GB to 15GB (I optimized for views). We could really use a way to take the benchmarks you ran, and put them into the buildbot. Sadly I can't do it with my real data because it belongs to someone else. However I hereby commit to produce a representative benchmark that is substantially similar in performance and data within the next two weeks. (Also note that there is nothing special about what I am doing - anyone with similar numbers of documents has similar issues.) I'm hoping that more can be done about the size issues soon too. (I think that addressing the size issues will help a lot since it will require way less CPU and I/O to produce and use smaller files.) Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy +3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn =rSWg -END PGP SIGNATURE-
Re: Objective criteria
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Damien Katz wrote: That's fine, the issue is that bugs saying It's too slow is always true for someone. I did give specific numbers - ie 10 million documents, 2GB of JSON data etc and the amount of time taken as well as space. I doubt you'd find anyone that considers 4 hours or 27GB to be reasonable numbers for that :-) Many people find the view indexing performance just fine, many do not. True. For example one of my other projects currently has 100 documents and I have no issue any part of CouchDB for that. What isn't clear is statement of reasonable expectations - should I be able to handle 10 million documents in CouchDB? Will it ever handle that? Do you as the project leader care about that? Everything has a sweet spot and I am not asking you to make 10 million documents be encompassed by the sweet spot, but clearly if you never intend for CouchDB to handle that much data then I need to go elsewhere. Since CouchDB makes no performance or size guarantees, How about publishing some? Not guarantees but rather some expectations. For example if someone has 1GB of JSON data in 1 million documents what would be an expectation of size? The bugs can then be about substantial divergences from that. Unless you have specific bug to fix, or enhancement to make, don't use JIRA. The issues you closed listed specific enhancements (pipelining, multiple instances, different file format etc). I do acknowledge that I didn't supply code but I can't do everything :-) All my personal projects are open source - it isn't like I am trying to take and never give. Also, if you want something without view performance problems but similar to CouchDB, you should look at MongoDB. I did research the alternatives I could find. CouchDB is the only solution that was designed for replication (and hence offline working, occasional disconnection, any topology for replication etc). CouchDB is also the only one that allows for indices/views on data that is calculated rather than just extracting a particular value statically from the docs. (That can be worked around by calculated values and shoving them into the docs but is less elegant.) Other than that MongoDB seemed to be the nicest. But I really want CouchDB to take over the world. The concepts are right. The replication point of view is right etc. It not handling millions of documents in a reasonable amount of space and time is not right IMHO but I still don't know what the project opinion is. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOjaUACgkQmOOfHg372QQBcACgmi/Rn5jtsiFvJZy0ksC6F6BU 8fMAn225w2RxsTKY8M/cg/29YSxc6mek =4qrM -END PGP SIGNATURE-
Re: Objective criteria
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Anderson wrote: I know your data is on the large side and CouchDB doesn't auto-cluster Ah, the clue! I don't think my data is large by any measure (10 million docs, 2GB of JSON). SQLite (note *lite* in the name) doesn't break a sweat. It only occupies 40% of a DVD. Picking a random low end machine from Dell shows that they ship with a minimum of 1GB of RAM and ideally want you to buy 2GB. Something that fits in the RAM of a $350 machine from Dell is not what I would consider large! The data fits in my machine's RAM 4 times over. Can you even buy USB sticks or SD cards these days smaller than 2GB? You could fit 15 copies of my data and an operating system in the smallest SSD drives. GMail's initial quota however many years ago was 1GB. Keith Packard's email is half a million messages but 5GB of data - http://keithp.com/blogs/notmuch/ My machine has 350,000 files and directories (excluding backups which duplicate many of those multiple times over). This is a similar order of magnitude as my data set (and several times larger if counting backups). (Note I am just talking about if you constructed a database of file and directory names, information about them etc - not the contents.) My deployment plans are the opposite of clustering and partitioning as my data set is so small! I wanted to put a copy of CouchDB on each and every server and have them replicated to each other rather than dedicated networked data servers. If this kind of (trivial!) size means clustering, paritioning etc then CouchDB is not remotely appropriate for my circumstances, and probably not for people recording file and email databases. I only wish there was documentation somewhere saying what normal sizes are for CouchDB and expectations for them. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAktOx/wACgkQmOOfHg372QS8RgCgp5/GTZCHyZG3Sf8qaZMAAppe oM8AniZXLa2tXOw78w/N29shsQqOMBWT =46bm -END PGP SIGNATURE-