Re: Using Google code review?

2010-01-13 Thread Robert Newson
consider Crucible? It integrates into Jira and is free to OSS projects.

http://www.atlassian.com/software/crucible/pricing.jsp

Crucible is free for use  by official non-profit organisations,
charities and open source projects.

On Wed, Jan 13, 2010 at 12:53 PM, Noah Slater nsla...@apache.org wrote:
 Hey,

 Do you think there's anyway to integrate Google code review with JIRA?

 He's an example I just plucked from the front page:

        http://codereview.appspot.com/186119/show

 Thoughts?

 Noah



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-13 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799769#action_12799769
 ] 

Paul Joseph Davis commented on COUCHDB-620:
---

The error reporting issue is that if you've got four docs in the pipeline, and 
the process dies, then its hard to tell which document caused the error. And 
generally retrying will just cause another error.

 Generating views is extremely slow - makes CouchDB hard to use with 
 non-trivial number of docs
 --

 Key: COUCHDB-620
 URL: https://issues.apache.org/jira/browse/COUCHDB-620
 Project: CouchDB
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: 0.10
 Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
Reporter: Roger Binns
Assignee: Damien Katz
 Attachments: pipelining.jpg


 Generating views is extremely slow.  For example adding 10 million documents 
 takes less than 10 minutes but generating some simple views on the same docs 
 takes over 4 hours.
 Using top you can see that CouchDB (erlang) and couchjs between them cannot 
 even saturate a single CPU let alone the I/O system.  Under ideal conditions 
 performance should be limited by cpu, disk or memory.  This implies that the 
 processes are doing simple things in lockstep accumulating latencies in each 
 process as well as the communication between them which when multiplied by 
 the number of documents can amount to a lot.
 Some suggestions:
 * Run as many couchjs instances as there are processor cores and scatter work 
 amongst them
 * Have some sort of pipelining in the erlang so that the moment the first 
 byte of response is received from couchjs the data is sent for the next 
 request (the JSON conversion, HTTP headers etc should all have been assembled 
 already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
 separate threads to read requests, process them and write responses).
 * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
 always has a doc ready to work on rather than having to transmit an entire 
 response and then wait for erlang to think and provide an entire new request
 A simple test of success is to have a database with a million or so documents 
 with a trivial view and have view creation max out the CPU,. memory or disk.
 Some things in CouchDB make this a particularly nasty problem.  View data is 
 not replicated so replicating documents can lead the view data by a large 
 margin on the recipient database.  This can lead to inconsistencies.  You 
 also can't expect users to then wait minutes (or hours) for a request to 
 complete because the view generation got that far behind.  (My own plans now 
 are to not use replication and instead create the database file on another 
 couchdb instance and then rsync the binary database file over instead!)
 Although stale=ok is available, you still have no idea if the response will 
 be quick or take however long view generation does.  (Sure I could add some 
 sort of timeout and complicate the code but then what value do I pick?  If I 
 have a user waiting I want an answer ASAP or I have to give them some 
 horrible error message.  Taking a long wait and then giving a timeout is even 
 worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799829#action_12799829
 ] 

Paul Joseph Davis commented on COUCHDB-583:
---

Just some quick thoughts reading through the diff:

I'm not a fan of the file containing a list of compressible types. There are 
too many types that will just make that configuration hard. Not to mention 
exposing an entirely new API endpoint to work with those types is also 
needlessly complex.

I'd prefer to see an automatic test trying to compress the first 4K or so of an 
attachment and use a heuristic to determine whether it compressed enough to 
justify compressing the entire attachment. If that's not doable, the 
compressible type system should be integrated into the current configuration 
mechanism.

For testing from FireFox it might be best to expose a attachment is stored in 
compressed form attribute in the _attachments member.

Passing around the Y and N binaries as a flag for an attachment 
being compressed is un-erlangy. true and false atoms would be better.

Test code does not belong in couch_httpd.erl.

Is there something I'm missing on why we need to leak couch_util:gzip* 
functions into couch_httpd_db.erl instead of putting all of that logic into 
couch_stream.erl?

Is there nothing in mochiweb to handle accept-encoding parsing?

Instead of naming tests test1 - test17 and comments above each test, just use 
a descriptive test name. It might help to group related tests as well so that 
tests are easier to find.

Data in the etap tests shouldn't be stored inline when its that big. Create 
data files and use the test helpers to reference the filenames and read from 
disk.

 storing attachments in compressed form and serving them in compressed form if 
 accepted by the client
 

 Key: COUCHDB-583
 URL: https://issues.apache.org/jira/browse/COUCHDB-583
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core, HTTP Interface
 Environment: CouchDB trunk
Reporter: Filipe Manana
 Attachments: couchdb-583-trunk-3rd-try.patch, 
 couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, 
 couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, 
 couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, 
 jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch


 This feature allows Couch to gzip compress attachments as they are being 
 received and store them in compressed form.
 When a client asks for downloading an attachment (e.g. GET 
 somedb/somedoc/attachment.txt), the attachment is sent in compressed form if 
 the client's http request has gzip specified as a valid transfer encoding for 
 the response (using the http header Accept-Encoding). Otherwise couch 
 decompresses the attachment before sending it back to the client.
 Attachments are compressed only if their MIME type matches one of those 
 listed in a separate config file. Compression level is also configurable in 
 the default.ini file.
 This follows Damien's suggestion from 30 November:
 Perhaps we need a separate user editable ini file to specify compressable or 
 non-compressable files (would probably be too big for the regular ini file). 
 What do other web servers do?
 Also, a potential optimization is to compress the file while writing to disk, 
 and serve the compressed bytes directly to clients that can handle it, and 
 decompressed for those that can't. For compressable types, it's a win for 
 both disk IO for reads and writes, and CPU on read.
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Roger Binns (JIRA)
File format for views is space and time inefficient - use a better one
--

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns


This was discussed on the dev mailing list over the last few days and noted 
here so it isn't forgotten.

The main database file format is optimised for data integrity - not losing or 
mangling documents - and rightly so.

That same append-only format is also used for views where it is a poor fit.  
The more random the ordering of data supplied, the larger the btree.  The 
larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
compacting to 900MB).  Since views are not replicated, this requires a 
disproportionate amount of disk space on each receiving server (not to mention 
I/O load).  The format also affects view generation performance.  By loading my 
documents into CouchDB in an order by the most emitted value in views I was 
able to reduce load time from 75 minutes to 40 minutes with the view file size 
being 15GB instead of 27GB, but still very distant from the 900MB post 
compaction.

Views are a performance enhancement.  They save you from having to visit every 
document when doing some queries.  The data within in a view is generated and 
hence the only consequence of losing view data is a performance one and the 
view can be regenerated anyway.  Consequently the file format should be one 
that is optimised for performance and size.  The only integrity feature needed 
is the ability to tell that the view is potentially corrupt (eg the power 
failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: openid 1.1 authentication handler

2010-01-13 Thread Matteo Caprari
Hi.

As suggested by chris on the user list, it could be interesting to
integrate the openid handler
in the couch.

I guess before going any further there should be some discussion here
and I'll probably need some suggestions about code guidelines and
prepping the makefiles.

I should also mention that the goal is to have the couch work both as
openid client and endpoint.

cheers

On Wed, Jan 13, 2010 at 6:03 PM, Chris Anderson jch...@apache.org wrote:
 On Wed, Jan 13, 2010 at 9:20 AM, Matteo Caprari
 matteo.capr...@gmail.com wrote:
 Hi.

 I've released an authentication handler that adds support for
 authenticating with openid 1.1.
 It works but needs to be stressed a bit.

 Source and readme:
 http://github.com/mcaprari/couchdb-openid

 blogged (copied the readme):
 http://caprazzi.net/posts/openid-authentication-handler-for-couchdb/

 This looks really cool. If you want to work on getting it into CouchDB
 (might require some cleanup) you should bring it up on the dev list,
 and put the patch into Jira:

 http://issues.apache.org/jira/browse/COUCHDB

 Chris


 cheers
 --
 :Matteo Caprari
 matteo.capr...@gmail.com




 --
 Chris Anderson
 http://jchrisa.net
 http://couch.io




-- 
:Matteo Caprari
matteo.capr...@gmail.com


[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799864#action_12799864
 ] 

Chris Anderson commented on COUCHDB-623:


It's worth nothing that another advantage to using the storage btrees is the 
MVCC guarantees. This means that a slow client can take its sweet time to 
traverse the view index, and is not affected by ongoing writes or deletes.

This is crucial for the consistency guarantees views make.

It is not very hard to create alternate view index systems (like 
CouchDB-Lounge) and the overhead of running as an external is negligible. One 
fine way to prototype a view system that optimizes for different things would 
be as an external.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799878#action_12799878
 ] 

Roger Binns commented on COUCHDB-623:
-

What are the consistency guarantees that views make?  I can't find any 
documentation about it anywhere!  (There is plenty about the main db, but 
nothing about views.)

I can't see any that you can make as the view data is derived from the 
documents and the documents can be changed at any point.  For example while the 
first row of a view is being returned the same corresponding document could 
have been deleted.  The slow client example can also lead to inconsistent 
data - for example it may update a document on one connection and then access 
the view on a second connection and due to timing end up with the view not 
including that document.

The only consistency guarantee I can see is that if you do not 
add/change/delete the documents for the period shortly before and then during 
view retrieval until the view is completely retrieved then the view will 
reflect the documents correctly at that time.  If there is any form of 
concurrency between the documents and the views then there cannot be guarantees 
unless CouchDB introduced a transactioning system.

I do see how the append only btree/mvcc format makes the view retrieval code 
easier to write, but users of CouchDB do not care how hard the code is to write 
:-)

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799891#action_12799891
 ] 

Paul Joseph Davis commented on COUCHDB-623:
---

The consistency guarantee refers to the file format used guarantees on disk 
consistency the same as is done for the main database file (ie, tail append 
MVCC style). Its not a reference to figuring out the sync between the main db 
and the view. As you point out doing things like querying with stale=ok can 
give you a view result that does not reflect the most recent changes to the 
database or reflects changes from other clients etc etc.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Adam Kocoloski (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896
 ] 

Adam Kocoloski commented on COUCHDB-623:


I believe by consistency guarantees Chris meant that a view request uses a 
single snapshot of the view index for the entire response.  Even if documents 
are changed in the interim, and even if someone else has triggered a view 
update, your response will still accurately reflect the state of the DB at a 
single moment in time.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson
On Wed, Jan 13, 2010 at 11:34 AM, Adam Kocoloski (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799896#action_12799896
  ]

 Adam Kocoloski commented on COUCHDB-623:
 

 I believe by consistency guarantees Chris meant that a view request uses a 
 single snapshot of the view index for the entire response.  Even if documents 
 are changed in the interim, and even if someone else has triggered a view 
 update, your response will still accurately reflect the state of the DB at a 
 single moment in time.

Thanks Adam, that's exactly what I'm talking about.


 File format for views is space and time inefficient - use a better one
 --

                 Key: COUCHDB-623
                 URL: https://issues.apache.org/jira/browse/COUCHDB-623
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
    Affects Versions: 0.10
            Reporter: Roger Binns

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing 
 or mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of 
 raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file 
 (before compacting to 900MB).  Since views are not replicated, this requires 
 a disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a 
 performance one and the view can be regenerated anyway.  Consequently the 
 file format should be one that is optimised for performance and size.  The 
 only integrity feature needed is the ability to tell that the view is 
 potentially corrupt (eg the power failed while it was being 
 generated/updated).

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.





-- 
Chris Anderson
http://jchrisa.net
http://couch.io


[jira] Closed: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Damien Katz (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Katz closed COUCHDB-623.
---

Resolution: Invalid
  Assignee: Damien Katz

Closing as Invalid this has no objective criteria for being resolved.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns
Assignee: Damien Katz

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-13 Thread Damien Katz
Lets have this discussion on the dev mailing list. That's what it's for.

-Damien


On Jan 10, 2010, at 9:27 PM, Roger Binns (JIRA) wrote:

 Generating views is extremely slow - makes CouchDB hard to use with 
 non-trivial number of docs
 --
 
 Key: COUCHDB-620
 URL: https://issues.apache.org/jira/browse/COUCHDB-620
 Project: CouchDB
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: 0.10
 Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
Reporter: Roger Binns
 
 
 Generating views is extremely slow.  For example adding 10 million documents 
 takes less than 10 minutes but generating some simple views on the same docs 
 takes over 4 hours.
 
 Using top you can see that CouchDB (erlang) and couchjs between them cannot 
 even saturate a single CPU let alone the I/O system.  Under ideal conditions 
 performance should be limited by cpu, disk or memory.  This implies that the 
 processes are doing simple things in lockstep accumulating latencies in each 
 process as well as the communication between them which when multiplied by 
 the number of documents can amount to a lot.
 
 Some suggestions:
 
 * Run as many couchjs instances as there are processor cores and scatter work 
 amongst them
 
 * Have some sort of pipelining in the erlang so that the moment the first 
 byte of response is received from couchjs the data is sent for the next 
 request (the JSON conversion, HTTP headers etc should all have been assembled 
 already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
 separate threads to read requests, process them and write responses).
 
 * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
 always has a doc ready to work on rather than having to transmit an entire 
 response and then wait for erlang to think and provide an entire new request
 
 A simple test of success is to have a database with a million or so documents 
 with a trivial view and have view creation max out the CPU,. memory or disk.
 
 Some things in CouchDB make this a particularly nasty problem.  View data is 
 not replicated so replicating documents can lead the view data by a large 
 margin on the recipient database.  This can lead to inconsistencies.  You 
 also can't expect users to then wait minutes (or hours) for a request to 
 complete because the view generation got that far behind.  (My own plans now 
 are to not use replication and instead create the database file on another 
 couchdb instance and then rsync the binary database file over instead!)
 
 Although stale=ok is available, you still have no idea if the response will 
 be quick or take however long view generation does.  (Sure I could add some 
 sort of timeout and complicate the code but then what value do I pick?  If I 
 have a user waiting I want an answer ASAP or I have to give them some 
 horrible error message.  Taking a long wait and then giving a timeout is even 
 worse!)
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 



[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799916#action_12799916
 ] 

Roger Binns commented on COUCHDB-623:
-

Not again Damien :-)

Simple criteria - the size of the view file should be proportionate to the data 
in a view on initial generation.  If you want raw numbers, the view file should 
be no larger than double the sum of JSON encoded key, value and _id for each 
row.

The current multiplier is 15 to 27 times as much which is ludicrous.  Even post 
compactation the file is a little on the large side.  And because the view 
results are not replicated, the overhead has to be incurred on every machine 
that replication happens to.

Or put another way, if people are planning on deploying CouchDB how much space 
would you advise them to provision?  

When I started, the answer for 10million documents/2.5GB of raw JSON is 72GB:

  23GB for DB, another 21GB for the compacted version, 27+GB for view file, 
another 1+GB for compacted view file

By shortening ids to 4 bytes instead of 16 we get:

  4GB for DB, another 4GB for compacted, 27GB for view file, another 1GB for 
compacted view file

By being able to sort my documents to be ordered by the most commonly emitted 
view key:
 
  4GB for DB, another 4GB for compacted, 15GB for view file, another 1GB for 
compacted view file

Since the view/DB coexists at the same time as the compaction you need space 
for both simultaneously. 10 million documents/2GB of data is not something that 
makes any existing database system sweat.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns
Assignee: Damien Katz

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Filipe Manana (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799922#action_12799922
 ] 

Filipe Manana commented on COUCHDB-583:
---

Hi Paul,

thanks for you're feedback.

Passing around the Y and N binaries as a flag for an attachment 
being compressed is un-erlangy. true and false atoms would be better. 

Well, this was mostly because I read somewhere in Armstrong's book that 
binaries are preferred (more efficient) for IO operations (network, disk 
storage). But I agree, using true / false atoms is more readable.

Is there nothing in mochiweb to handle accept-encoding parsing? 

I don't think so, at least in the mochiweb included with couch. It's probably 
better to move this accept-encoding parsing functions, and the respective test 
functions, into the mochiweb sources.

I'll get back to work and enhance the patch following your remarks.

cheers

 storing attachments in compressed form and serving them in compressed form if 
 accepted by the client
 

 Key: COUCHDB-583
 URL: https://issues.apache.org/jira/browse/COUCHDB-583
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core, HTTP Interface
 Environment: CouchDB trunk
Reporter: Filipe Manana
 Attachments: couchdb-583-trunk-3rd-try.patch, 
 couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, 
 couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, 
 couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, 
 jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch


 This feature allows Couch to gzip compress attachments as they are being 
 received and store them in compressed form.
 When a client asks for downloading an attachment (e.g. GET 
 somedb/somedoc/attachment.txt), the attachment is sent in compressed form if 
 the client's http request has gzip specified as a valid transfer encoding for 
 the response (using the http header Accept-Encoding). Otherwise couch 
 decompresses the attachment before sending it back to the client.
 Attachments are compressed only if their MIME type matches one of those 
 listed in a separate config file. Compression level is also configurable in 
 the default.ini file.
 This follows Damien's suggestion from 30 November:
 Perhaps we need a separate user editable ini file to specify compressable or 
 non-compressable files (would probably be too big for the regular ini file). 
 What do other web servers do?
 Also, a potential optimization is to compress the file while writing to disk, 
 and serve the compressed bytes directly to clients that can handle it, and 
 decompressed for those that can't. For compressable types, it's a win for 
 both disk IO for reads and writes, and CPU on read.
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Damien Katz (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799937#action_12799937
 ] 

Damien Katz commented on COUCHDB-583:
-

I haven't looked at the patch, but I agree with most of Paul comments, except 
for figuring out when to compress files. Lots of compressed files might have 
uncompressed headers in the file, leading to unnecessary compression. MP3s with 
id3v2 tags immediately come to mind.

 storing attachments in compressed form and serving them in compressed form if 
 accepted by the client
 

 Key: COUCHDB-583
 URL: https://issues.apache.org/jira/browse/COUCHDB-583
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core, HTTP Interface
 Environment: CouchDB trunk
Reporter: Filipe Manana
 Attachments: couchdb-583-trunk-3rd-try.patch, 
 couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, 
 couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, 
 couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, 
 jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch


 This feature allows Couch to gzip compress attachments as they are being 
 received and store them in compressed form.
 When a client asks for downloading an attachment (e.g. GET 
 somedb/somedoc/attachment.txt), the attachment is sent in compressed form if 
 the client's http request has gzip specified as a valid transfer encoding for 
 the response (using the http header Accept-Encoding). Otherwise couch 
 decompresses the attachment before sending it back to the client.
 Attachments are compressed only if their MIME type matches one of those 
 listed in a separate config file. Compression level is also configurable in 
 the default.ini file.
 This follows Damien's suggestion from 30 November:
 Perhaps we need a separate user editable ini file to specify compressable or 
 non-compressable files (would probably be too big for the regular ini file). 
 What do other web servers do?
 Also, a potential optimization is to compress the file while writing to disk, 
 and serve the compressed bytes directly to clients that can handle it, and 
 decompressed for those that can't. For compressable types, it's a win for 
 both disk IO for reads and writes, and CPU on read.
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Filipe Manana (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799947#action_12799947
 ] 

Filipe Manana commented on COUCHDB-583:
---

Hum, 

Lets open a votation :)

1) use an heuristic, as suggested by Paul

2) or a file listing the mime types worth compressing

3) some other alternative?

cheers

 storing attachments in compressed form and serving them in compressed form if 
 accepted by the client
 

 Key: COUCHDB-583
 URL: https://issues.apache.org/jira/browse/COUCHDB-583
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core, HTTP Interface
 Environment: CouchDB trunk
Reporter: Filipe Manana
 Attachments: couchdb-583-trunk-3rd-try.patch, 
 couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, 
 couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, 
 couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, 
 jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch


 This feature allows Couch to gzip compress attachments as they are being 
 received and store them in compressed form.
 When a client asks for downloading an attachment (e.g. GET 
 somedb/somedoc/attachment.txt), the attachment is sent in compressed form if 
 the client's http request has gzip specified as a valid transfer encoding for 
 the response (using the http header Accept-Encoding). Otherwise couch 
 decompresses the attachment before sending it back to the client.
 Attachments are compressed only if their MIME type matches one of those 
 listed in a separate config file. Compression level is also configurable in 
 the default.ini file.
 This follows Damien's suggestion from 30 November:
 Perhaps we need a separate user editable ini file to specify compressable or 
 non-compressable files (would probably be too big for the regular ini file). 
 What do other web servers do?
 Also, a potential optimization is to compress the file while writing to disk, 
 and serve the compressed bytes directly to clients that can handle it, and 
 decompressed for those that can't. For compressable types, it's a win for 
 both disk IO for reads and writes, and CPU on read.
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-583) storing attachments in compressed form and serving them in compressed form if accepted by the client

2010-01-13 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799948#action_12799948
 ] 

Paul Joseph Davis commented on COUCHDB-583:
---

Hrm, 4KiB of headers even? That is a good point though. But I'd still be quite 
hesitant to make it a whitelist of of content types to compress. Unless maybe 
we allowed text/* or similar. Or perhaps is should be a blacklist that could do 
the * match?

 storing attachments in compressed form and serving them in compressed form if 
 accepted by the client
 

 Key: COUCHDB-583
 URL: https://issues.apache.org/jira/browse/COUCHDB-583
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core, HTTP Interface
 Environment: CouchDB trunk
Reporter: Filipe Manana
 Attachments: couchdb-583-trunk-3rd-try.patch, 
 couchdb-583-trunk-4th-try-trunk.patch, couchdb-583-trunk-5th-try.patch, 
 couchdb-583-trunk-6th-try.patch, couchdb-583-trunk-7th-try.patch, 
 couchdb-583-trunk-8th-try.patch, couchdb-583-trunk-9th-try.patch, 
 jira-couchdb-583-1st-try-trunk.patch, jira-couchdb-583-2nd-try-trunk.patch


 This feature allows Couch to gzip compress attachments as they are being 
 received and store them in compressed form.
 When a client asks for downloading an attachment (e.g. GET 
 somedb/somedoc/attachment.txt), the attachment is sent in compressed form if 
 the client's http request has gzip specified as a valid transfer encoding for 
 the response (using the http header Accept-Encoding). Otherwise couch 
 decompresses the attachment before sending it back to the client.
 Attachments are compressed only if their MIME type matches one of those 
 listed in a separate config file. Compression level is also configurable in 
 the default.ini file.
 This follows Damien's suggestion from 30 November:
 Perhaps we need a separate user editable ini file to specify compressable or 
 non-compressable files (would probably be too big for the regular ini file). 
 What do other web servers do?
 Also, a potential optimization is to compress the file while writing to disk, 
 and serve the compressed bytes directly to clients that can handle it, and 
 decompressed for those that can't. For compressable types, it's a win for 
 both disk IO for reads and writes, and CPU on read.
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965
 ] 

Roger Binns commented on COUCHDB-623:
-

The view consistency stuff is a red herring.  If you are not making changes to 
the DB then any file format will work and give consistent results.

If you are making changes to the docs then no scheme short of 
transactions/locking will ensure that the view is consistent with the 
documents.  It will always be possible for documents to be referenced by the 
view that are not in the DB and for documents to be in the DB that are not in 
the view.  I see no point in trying to even make the view consistent with a 
point in time while DB changes are happening since it gives no performance 
efficiency nor any space efficiency - in fact the extreme opposites.

The point of views is to give me information fast that I could only otherwise 
obtain by visiting all the documents.  That is what they should be optimized 
for.

 File format for views is space and time inefficient - use a better one
 --

 Key: COUCHDB-623
 URL: https://issues.apache.org/jira/browse/COUCHDB-623
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.10
Reporter: Roger Binns
Assignee: Damien Katz

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing or 
 mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of raw 
 JSON data turns into a 3.9GB CouchDB database but a 27GB view file (before 
 compacting to 900MB).  Since views are not replicated, this requires a 
 disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a performance 
 one and the view can be regenerated anyway.  Consequently the file format 
 should be one that is optimised for performance and size.  The only integrity 
 feature needed is the ability to tell that the view is potentially corrupt 
 (eg the power failed while it was being generated/updated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson
On Wed, Jan 13, 2010 at 2:11 PM, Roger Binns (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/COUCHDB-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799965#action_12799965
  ]

 Roger Binns commented on COUCHDB-623:
 -

 The view consistency stuff is a red herring.  If you are not making changes 
 to the DB then any file format will work and give consistent results.

 If you are making changes to the docs then no scheme short of 
 transactions/locking will ensure that the view is consistent with the 
 documents.  It will always be possible for documents to be referenced by the 
 view that are not in the DB and for documents to be in the DB that are not in 
 the view.  I see no point in trying to even make the view consistent with a 
 point in time while DB changes are happening since it gives no performance 
 efficiency nor any space efficiency - in fact the extreme opposites.

 The point of views is to give me information fast that I could only otherwise 
 obtain by visiting all the documents.  That is what they should be optimized 
 for.

the current views are optimized for youre red herring. here it
actually matters is the ability to give transactional information
about things like bank account balances.

see: http://books.couchdb.org/relax/reference/recipes for Banking

without MVCC views, there's no way to query accurately at all when
inserts are underway (short of blocking reads during writes).

If you need something with less consistency, you are encouraged to
wrap your own indexing system around couchdb's map reduce runtime, or
even build your own runtime.

has anyone used Hadoop as an external yet?

Chris



 File format for views is space and time inefficient - use a better one
 --

                 Key: COUCHDB-623
                 URL: https://issues.apache.org/jira/browse/COUCHDB-623
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
    Affects Versions: 0.10
            Reporter: Roger Binns
            Assignee: Damien Katz

 This was discussed on the dev mailing list over the last few days and noted 
 here so it isn't forgotten.
 The main database file format is optimised for data integrity - not losing 
 or mangling documents - and rightly so.
 That same append-only format is also used for views where it is a poor fit.  
 The more random the ordering of data supplied, the larger the btree.  The 
 larger the keys (in bytes) the larger the btree.  As an example my 2GB of 
 raw JSON data turns into a 3.9GB CouchDB database but a 27GB view file 
 (before compacting to 900MB).  Since views are not replicated, this requires 
 a disproportionate amount of disk space on each receiving server (not to 
 mention I/O load).  The format also affects view generation performance.  By 
 loading my documents into CouchDB in an order by the most emitted value in 
 views I was able to reduce load time from 75 minutes to 40 minutes with the 
 view file size being 15GB instead of 27GB, but still very distant from the 
 900MB post compaction.
 Views are a performance enhancement.  They save you from having to visit 
 every document when doing some queries.  The data within in a view is 
 generated and hence the only consequence of losing view data is a 
 performance one and the view can be regenerated anyway.  Consequently the 
 file format should be one that is optimised for performance and size.  The 
 only integrity feature needed is the ability to tell that the view is 
 potentially corrupt (eg the power failed while it was being 
 generated/updated).

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.





-- 
Chris Anderson
http://jchrisa.net
http://couch.io


Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Anderson wrote:
 see: http://books.couchdb.org/relax/reference/recipes for Banking
 
 without MVCC views, there's no way to query accurately at all when
 inserts are underway (short of blocking reads during writes).

I am afraid I do not understand what you are saying.  Sure the scheme listed
in the book makes sense, but only if a transaction maps exactly to one
document (which I guess is the point).  Even then I still don't see the
relevance.  Things would only break down if the view returned partial
information (eg if a single document caused two view rows to be emitted but
only one of those was returned.) BTW views do not return the update_seq so
as an end user you still do not know up to date it is.

The file format does not need to protect each view row, but does need to do
so for the main database where the unit is a document.

For example the view file format could use an atomic unit of 10,000
document's view output or some number of megabytes.  That unit can still be
regenerated if something bad happened (a rare circumstance such as untimely
power failure).

 If you need something with less consistency, you are encouraged to
 wrap your own indexing system around couchdb's map reduce runtime, or
 even build your own runtime.

I am becoming very tempted to just dump CouchDB for SQLite with a trivial
REST front end, since it appears that CouchDB is just not capable of
handling 10million documents/2GB of data in anything resembling a sensible
amount of disk space or compute time for the foreseeable future.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6
1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y
=7gg2
-END PGP SIGNATURE-



Objective criteria

2010-01-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

[I wrote a personal query to Damien which he asked me to repeat here.]

Both 620 and 623 were closed by Damien because they lacked objective
criteria.  For both tickets there are criteria I consider objective :-)
There are also suggestions on how to address the issues.

Consequently my query is if the criteria are not objective enough, does
Damien not agree with them, not care about the underlying issues, think that
a 10 million document/2GB raw json data set is outside the scope of what
CouchDB should cope with, want this stuff in the wiki etc?

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOUP4ACgkQmOOfHg372QS/hQCfSP9Edy+wrZRFwItFmDD3mNcN
yyIAn2z9XvJigm2xKk/r4CgAUqZp1t/i
=JMLG
-END PGP SIGNATURE-



Re: [jira] Commented: (COUCHDB-623) File format for views is space and time inefficient - use a better one

2010-01-13 Thread Chris Anderson
On Wed, Jan 13, 2010 at 2:54 PM, Roger Binns rog...@rogerbinns.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Chris Anderson wrote:
 see: http://books.couchdb.org/relax/reference/recipes for Banking

 without MVCC views, there's no way to query accurately at all when
 inserts are underway (short of blocking reads during writes).

 I am afraid I do not understand what you are saying.  Sure the scheme listed
 in the book makes sense, but only if a transaction maps exactly to one
 document (which I guess is the point).  Even then I still don't see the
 relevance.  Things would only break down if the view returned partial
 information (eg if a single document caused two view rows to be emitted but
 only one of those was returned.) BTW views do not return the update_seq so
 as an end user you still do not know up to date it is.

If that would help, there are I think people working on an update_seq
patch for views.


 The file format does not need to protect each view row, but does need to do
 so for the main database where the unit is a document.

A reduce giving an balance for a particular account, could be effected
by documents being inserted anywhere in the db. The current map reduce
system guarantees that the balance returned reflects a consistent
snapshot of the database, even if other operations are ongoing. (eg a
given transfer will appear consistently, even if those same accounts
are undergoing concurrent operations for other transfers.)

We can't atomically prevent overdrafts, but how many banks do that, anyway?


 For example the view file format could use an atomic unit of 10,000
 document's view output or some number of megabytes.  That unit can still be
 regenerated if something bad happened (a rare circumstance such as untimely
 power failure).

There are alternate storage systems which either use locking or avoid
any consistency guarantees at all.


 If you need something with less consistency, you are encouraged to
 wrap your own indexing system around couchdb's map reduce runtime, or
 even build your own runtime.

 I am becoming very tempted to just dump CouchDB for SQLite with a trivial
 REST front end, since it appears that CouchDB is just not capable of
 handling 10million documents/2GB of data in anything resembling a sensible
 amount of disk space or compute time for the foreseeable future.


It sounds like Couch does just fine if you run compaction. Perhaps we
should recommend view compaction more aggressively.

Chris

 Roger
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAktOTyUACgkQmOOfHg372QRoZwCgqMCpYfZT3aHYXGMfqfMzXpk6
 1UIAoN+CV+wtsyOW8Ndiq7c/qM5Qt4+Y
 =7gg2
 -END PGP SIGNATURE-





-- 
Chris Anderson
http://jchrisa.net
http://couch.io


Re: Objective criteria

2010-01-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Anderson wrote:
 The ticketing system should be for smaller scope issues, I think.

I see it more as a don't forget about this plus somewhere for others to
say this also affects me or here is additional information/angles.
Obviously there is a fine line between that kind of thing and a discussion.
 My big concern is that the issue was hashed out here over a few days then
the thread goes dead, and the issue is forgotten.  A JIRA report of open
issues should be a todo list of bugs to fix and improvements to make.

 Optimizing the view server is an agreed goal of the community.

Maybe in people's heads, but it wasn't written down anywhere such as the
tracker or the roadmap.  In fact the front page of couchdb.org claims that
Erlang allows for the CouchDB design to be scalable and the overview page
makes an efficient claim in the last sentence.  The current implementation
is neither of these.

 Probably the best way to help is to take a look at all the work
 Damien's done in trunk (the pipelining) and perhaps the parallel
 writers optimization he has. 

BTW I have been using trunk for over a week.  It is better than 0.10 that I
was using before but not that much of an improvement.  And changes in the
way I generate some of my data have hurt me again (I can either order for
_id or for view keys but not both at the same time) so my initial DB has now
gone from 4GB to 15GB (I optimized for views).

 We could really use a way to take the
 benchmarks you ran, and put them into the buildbot.

Sadly I can't do it with my real data because it belongs to someone else.
However I hereby commit to produce a representative benchmark that is
substantially similar in performance and data within the next two weeks.
(Also note that there is nothing special about what I am doing - anyone with
similar numbers of documents has similar issues.)

I'm hoping that more can be done about the size issues soon too.  (I think
that addressing the size issues will help a lot since it will require way
less CPU and I/O to produce and use smaller files.)

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy
+3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn
=rSWg
-END PGP SIGNATURE-



Re: Objective criteria

2010-01-13 Thread Mikeal Rogers
I have been putting together some stuff that seems pertinent to this
discussion.

I'm working on a performance suite that tests a variety of concurrent
performance scenarios.

I have the client code written but I'm still working on the automated
build/test code. Once that is finished I plan to do some GitHub integration
and some charting.

The idea here is to chart the performance differences between a GitHub
branch at a certain commit compared to the performance of the latest release
and the latest trunk.

If someone has an idea of how they might increase performance they could
point this tool at their GitHub branch and reference the differences in
performance between their code and the latest release and trunk.

I'll send another email once i have some pretty graphs to show off :)

-Mikeal

On Wed, Jan 13, 2010 at 5:19 PM, Damien Katz dam...@apache.org wrote:


 On Jan 13, 2010, at 3:37 PM, Roger Binns wrote:

  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Chris Anderson wrote:
  The ticketing system should be for smaller scope issues, I think.
 
  I see it more as a don't forget about this plus somewhere for others to
  say this also affects me or here is additional information/angles.
  Obviously there is a fine line between that kind of thing and a
 discussion.
  My big concern is that the issue was hashed out here over a few days then
  the thread goes dead, and the issue is forgotten.  A JIRA report of open
  issues should be a todo list of bugs to fix and improvements to make.

 That's fine, the issue is that bugs saying It's too slow is always true
 for someone. Many people find the view indexing performance just fine, many
 do not.

 Since CouchDB makes no performance or size guarantees, you can't call the
 general performance a bug. Unless you have specific bug to fix, or
 enhancement to make, don't use JIRA. Use the dev mailing list to make your
 case, see if someone will produce a patch or find a measurable bottleneck
 that can be addressed. JIRA is not the place for discussions about the
 design of CouchDB components. Neither is IRC.

 Also, if you want something without view performance problems but similar
 to CouchDB, you should look at MongoDB.

 -Damien


 
  Optimizing the view server is an agreed goal of the community.
 
  Maybe in people's heads, but it wasn't written down anywhere such as the
  tracker or the roadmap.  In fact the front page of couchdb.org claims
 that
  Erlang allows for the CouchDB design to be scalable and the overview page
  makes an efficient claim in the last sentence.  The current
 implementation
  is neither of these.
 
  Probably the best way to help is to take a look at all the work
  Damien's done in trunk (the pipelining) and perhaps the parallel
  writers optimization he has.
 
  BTW I have been using trunk for over a week.  It is better than 0.10 that
 I
  was using before but not that much of an improvement.  And changes in the
  way I generate some of my data have hurt me again (I can either order for
  _id or for view keys but not both at the same time) so my initial DB has
 now
  gone from 4GB to 15GB (I optimized for views).
 
  We could really use a way to take the
  benchmarks you ran, and put them into the buildbot.
 
  Sadly I can't do it with my real data because it belongs to someone else.
  However I hereby commit to produce a representative benchmark that is
  substantially similar in performance and data within the next two weeks.
  (Also note that there is nothing special about what I am doing - anyone
 with
  similar numbers of documents has similar issues.)
 
  I'm hoping that more can be done about the size issues soon too.  (I
 think
  that addressing the size issues will help a lot since it will require way
  less CPU and I/O to produce and use smaller files.)
 
  Roger
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v1.4.9 (GNU/Linux)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
  iEYEARECAAYFAktOWSgACgkQmOOfHg372QQ9rQCfTiSs1etiafvG1z4q1yQC0ZRy
  +3AAn0j2TEPuW/AXTvNZl9KPfTT6hGNn
  =rSWg
  -END PGP SIGNATURE-
 




Re: Objective criteria

2010-01-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Damien Katz wrote:
 That's fine, the issue is that bugs saying It's too slow is always true 
 for someone.

I did give specific numbers - ie 10 million documents, 2GB of JSON data etc
and the amount of time taken as well as space.  I doubt you'd find anyone
that considers 4 hours or 27GB to be reasonable numbers for that :-)

 Many people find the view indexing performance just fine, many do not.

True.  For example one of my other projects currently has 100 documents and
I have no issue any part of CouchDB for that.

What isn't clear is statement of reasonable expectations - should I be able
to handle 10 million documents in CouchDB?  Will it ever handle that?  Do
you as the project leader care about that?  Everything has a sweet spot and
I am not asking you to make 10 million documents be encompassed by the sweet
spot, but clearly if you never intend for CouchDB to handle that much data
then I need to go elsewhere.

 Since CouchDB makes no performance or size guarantees,

How about publishing some?  Not guarantees but rather some expectations.
For example if someone has 1GB of JSON data in 1 million documents what
would be an expectation of size? The bugs can then be about substantial
divergences from that.

 Unless you have specific bug to fix, or enhancement to make, don't use JIRA.

The issues you closed listed specific enhancements (pipelining, multiple
instances, different file format etc).  I do acknowledge that I didn't
supply code but I can't do everything :-)  All my personal projects are open
source - it isn't like I am trying to take and never give.

 Also, if you want something without view performance problems but similar to 
 CouchDB, you should look at MongoDB.

I did research the alternatives I could find.  CouchDB is the only solution
that was designed for replication (and hence offline working, occasional
disconnection, any topology for replication etc).  CouchDB is also the only
one that allows for indices/views on data that is calculated rather than
just extracting a particular value statically from the docs.  (That can be
worked around by calculated values and shoving them into the docs but is
less elegant.)

Other than that MongoDB seemed to be the nicest.  But I really want CouchDB
to take over the world.  The concepts are right.  The replication point of
view is right etc.  It not handling millions of documents in a reasonable
amount of space and time is not right IMHO but I still don't know what the
project opinion is.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOjaUACgkQmOOfHg372QQBcACgmi/Rn5jtsiFvJZy0ksC6F6BU
8fMAn225w2RxsTKY8M/cg/29YSxc6mek
=4qrM
-END PGP SIGNATURE-



Re: Objective criteria

2010-01-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris Anderson wrote:
 I know your data is on the large side and CouchDB doesn't auto-cluster

Ah, the clue!  I don't think my data is large by any measure (10 million
docs, 2GB of JSON).  SQLite (note *lite* in the name) doesn't break a sweat.
 It only occupies 40% of a DVD.  Picking a random low end machine from Dell
shows that they ship with a minimum of 1GB of RAM and ideally want you to
buy 2GB.  Something that fits in the RAM of a $350 machine from Dell is not
what I would consider large!  The data fits in my machine's RAM 4 times
over.  Can you even buy USB sticks or SD cards these days smaller than 2GB?
 You could fit 15 copies of my data and an operating system in the smallest
SSD drives.

GMail's initial quota however many years ago was 1GB.  Keith Packard's email
is half a million messages but 5GB of data - http://keithp.com/blogs/notmuch/

My machine has 350,000 files and directories (excluding backups which
duplicate many of those multiple times over).  This is a similar order of
magnitude as my data set (and several times larger if counting backups).
(Note I am just talking about if you constructed a database of file and
directory names, information about them etc - not the contents.)

My deployment plans are the opposite of clustering and partitioning as my
data set is so small!  I wanted to put a copy of CouchDB on each and every
server and have them replicated to each other rather than dedicated
networked data servers.

If this kind of (trivial!) size means clustering, paritioning etc then
CouchDB is not remotely appropriate for my circumstances, and probably not
for people recording file and email databases.  I only wish there was
documentation somewhere saying what normal sizes are for CouchDB and
expectations for them.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktOx/wACgkQmOOfHg372QS8RgCgp5/GTZCHyZG3Sf8qaZMAAppe
oM8AniZXLa2tXOw78w/N29shsQqOMBWT
=46bm
-END PGP SIGNATURE-