[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-20 Thread James Marca (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803089#action_12803089
 ] 

James Marca commented on COUCHDB-620:
-

I too am interested in getting view generation to go faster.  I have 2 
databases, one with 20 million documents, another with around 10 million.  The 
documents are large (hourly raw data dumps) and the view is tedious (5 minute 
averages of raw data), but generation is taking way too long.  I've been 
running the process for a few days and am only at 20% done on the smaller, and 
10% on the larger one.  

In testing things, I set up the 10 million document db at 12 smaller dbs, one 
per month.  There the performance was slow, but it took a weekend as I recall, 
not more.  At the time I was generating the views on all dbs at once.

I am running a very recent git version of couchdb.

Sorry, I don't know erlang at all so I can't submit patches.  I would be 
willing to learn enough erlang to recode my view, but haven't had time yet.


> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
> Attachments: pipelining.jpg
>
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-14 Thread Brian Candler (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800145#action_12800145
 ] 

Brian Candler commented on COUCHDB-620:
---

>The error reporting issue is that if you've got four docs in the pipeline, and 
>the process dies, then its hard to tell which document caused the error.

Well, I guess it is on the 'push' side. But if you update your pointers on the 
'pull' side, and you've had two documents back, then the error must be in the 
third.

> And generally retrying will just cause another error.

If the view server *crashes* when fed document X, then I think document X 
should be retried - i.e. this was probably an intermittent server error.

But if processing document X raised an *exception* (i.e. map function cannot 
handle the content) then I'd have thought that couchjs should catch, serialise 
and return the exception. That would allow the document to be skipped cleanly.


> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
> Attachments: pipelining.jpg
>
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-13 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799769#action_12799769
 ] 

Paul Joseph Davis commented on COUCHDB-620:
---

The error reporting issue is that if you've got four docs in the pipeline, and 
the process dies, then its hard to tell which document caused the error. And 
generally retrying will just cause another error.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
> Attachments: pipelining.jpg
>
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-12 Thread Brian Candler (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799586#action_12799586
 ] 

Brian Candler commented on COUCHDB-620:
---

Like Paul says: I am not proposing any change to the couchjs protocol, nor to 
allow out-of-order returning of responses from couchjs.

Just this: that the core takes the next 3 (say) documents to be processed, 
stuffs them down the socket to couchjs, then sends another one each time a 
whole document response is received.

The couchjs view server is completely unaware of this, since it runs lock-step 
(read a request, emit response, read request, emit response). It's just that 
when it next comes to read a request, there will be one waiting for it already.

The same as HTTP pipelining, in other words.

I don't see any particular problem with error handling. If you've not received 
a complete response for document X, then you don't update the view pointer so 
you'll try again next time.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
> Attachments: pipelining.jpg
>
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-12 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799411#action_12799411
 ] 

Roger Binns commented on COUCHDB-620:
-

The latency in a non-pipelined implementation really adds up.  For example an 
additional 1 millisecond latency adds almost 3 minutes to generation time with 
a 10 million document database.  Have a millisecond here, a millisecond there 
and pretty soon you are measuring generation times in hours  :-)

Since couchjs is not threaded I don't see any way for it to answer requests in 
a different order than sent.  (Ok you can do it with some sort of some internal 
state machine and non-blocking I/O like Python's Twisted but I'm pretty sure 
couchjs is not doing that either.)

The only complication with pipelining is error handling.  For example there may 
be 5 requests in the pipeline when the couchjs processes crashes.  Any 
unanswered requests would then need to be resubmitted to a freshly spawned 
couchjs.

(BTW Brian 110% CPU consumption has nothing to do with afterburners.  Strictly 
speaking I meant core not CPU.  It just means multiple threads at the OS level 
were running and aggregate consumption between them amounted to 110% of a 
single core.  Or in other words CouchDB/beam.smp consumed 27.5% of the total 
compute resources that were available - 4 cores in one CPU.  CouchDB also seems 
to avoid using more than 3% of my available RAM.)

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
> Attachments: pipelining.jpg
>
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-12 Thread Chris Anderson
On Tue, Jan 12, 2010 at 8:25 AM, Adam Kocoloski (JIRA)  wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799251#action_12799251
>  ]
>
> Adam Kocoloski commented on COUCHDB-620:
> 
>
> Brian is spot-on here regarding the next steps.  I haven't checked -- does 
> couchjs support pipelining?
>

The couchjs view server protocol is strictly line based. Each line is
parsed to JSON after it is received. Then the computation is done, and
a line of JSON is returned.

Since the couchjs server is single threaded it hasn't made much sense
to make the protocol more complex.

I think the best way to optimize here would be to have erlang open
more couchjs processes. I'm not sure how challenging this would be, or
how much actual benefit we'd see. Probably depends on workload and
server configuration.

Chris

>> Generating views is extremely slow - makes CouchDB hard to use with 
>> non-trivial number of docs
>> --
>>
>>                 Key: COUCHDB-620
>>                 URL: https://issues.apache.org/jira/browse/COUCHDB-620
>>             Project: CouchDB
>>          Issue Type: Improvement
>>          Components: Infrastructure
>>    Affects Versions: 0.10
>>         Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>>            Reporter: Roger Binns
>>            Assignee: Damien Katz
>>
>> Generating views is extremely slow.  For example adding 10 million documents 
>> takes less than 10 minutes but generating some simple views on the same docs 
>> takes over 4 hours.
>> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
>> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
>> performance should be limited by cpu, disk or memory.  This implies that the 
>> processes are doing simple things in lockstep accumulating latencies in each 
>> process as well as the communication between them which when multiplied by 
>> the number of documents can amount to a lot.
>> Some suggestions:
>> * Run as many couchjs instances as there are processor cores and scatter 
>> work amongst them
>> * Have some sort of pipelining in the erlang so that the moment the first 
>> byte of response is received from couchjs the data is sent for the next 
>> request (the JSON conversion, HTTP headers etc should all have been 
>> assembled already) to reduce latencies.  Do whatever is most similar in 
>> couchjs (eg use separate threads to read requests, process them and write 
>> responses).
>> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
>> always has a doc ready to work on rather than having to transmit an entire 
>> response and then wait for erlang to think and provide an entire new request
>> A simple test of success is to have a database with a million or so 
>> documents with a trivial view and have view creation max out the CPU,. 
>> memory or disk.
>> Some things in CouchDB make this a particularly nasty problem.  View data is 
>> not replicated so replicating documents can lead the view data by a large 
>> margin on the recipient database.  This can lead to inconsistencies.  You 
>> also can't expect users to then wait minutes (or hours) for a request to 
>> complete because the view generation got that far behind.  (My own plans now 
>> are to not use replication and instead create the database file on another 
>> couchdb instance and then rsync the binary database file over instead!)
>> Although stale=ok is available, you still have no idea if the response will 
>> be quick or take however long view generation does.  (Sure I could add some 
>> sort of timeout and complicate the code but then what value do I pick?  If I 
>> have a user waiting I want an answer ASAP or I have to give them some 
>> horrible error message.  Taking a long wait and then giving a timeout is 
>> even worse!)
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io


[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-12 Thread Adam Kocoloski (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799251#action_12799251
 ] 

Adam Kocoloski commented on COUCHDB-620:


Brian is spot-on here regarding the next steps.  I haven't checked -- does 
couchjs support pipelining?

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-12 Thread Brian Candler (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799129#action_12799129
 ] 

Brian Candler commented on COUCHDB-620:
---

"the CouchDB process took between 70% and 130% of a CPU, usually in the 110% 
range"

That looks like one CPU fully used for CouchDB (with afterburners to get to 
110% :-)

"The couchjs process was hovering around 25% of a CPU" - which suggests that 
couchjs is waiting on couchdb to issue it with more work, so parallelising 
couchjs wouldn't help.

For comparison, have you tried the erlang view server? That would eliminate 
JSON serialisation/deserialisation overhead, plus much of the message-passing 
overhead. It would be very useful to have this comparison on your large dataset.

If measurement shows that the json serialisation overhead is large, maybe 
there's a fairly simple improvement: in couch_os_process.erl, make writejson 
and readjson execute in separate erlang processes, so they can execute on 
another core. You would need to pipeline requests to the view server to get the 
full benefit though.


> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798962#action_12798962
 ] 

Roger Binns commented on COUCHDB-620:
-

The objective criteria is that the CPU and/or I/O is saturated, or at least 
close to saturation.  (Doing anything less is effectively adding gratuitous 
delays.  Operating systems have the ability to prioritize tasks easily once 
saturation has been achieved and is a better way of dealing with the issue.  
For example would you add gratuitous delays to a C compiler so it has less 
impact on a machine?)

I tried current svn trunk to see what the state is with my data set (10 million 
documents).

Generating the view now takes about 75 minutes, whereas before it took 4 hours. 
 (The machine was also upgraded this weekend from two to four cores, 2.8GHz 
speed to 3.6GHz and double the disk bandwidth - new drives and striping, so the 
numbers are not strictly comparable.)

During the view generation the CouchDB process took between 70% and 130% of a 
CPU, usually in the 110% range.  The couchjs process was hovering around 25% of 
a CPU.  Using iostat I could see between 5 and 30% of disk utilization, usually 
closer to 30%.  (iostat showed there was still plenty of disk access available.)

Quite simply the view generation is still a very long time but somewhat more 
tolerable, 2 1/2 cores of my machine sat idle during this time while the disk 
was idle 66% of the time.  Trunk is consequently an improvement but nowhere 
near as good as it could be. 

I think this ticket should be re-opened, with the word 'extremely' removed.

To understand why I care so much, my views are my document access.  That is how 
the documents are found - look for strings in the view.  If the string is not 
found then the document may as well not exist.  Except the documents do 
reference each other so while view generation is happening it is possible to 
have inconsistencies - the view claiming a document doesn't exist while another 
linked one saying the first one does.

And since views are generated on each machine after replication I can't incur 
the generation overhead on one machine and then replicate the results.


> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>Assignee: Damien Katz
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be qui

[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Adam Kocoloski (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798736#action_12798736
 ] 

Adam Kocoloski commented on COUCHDB-620:


So, when a public version of sharding lands oversharding (hosting multiple 
partitions of a DB on a single server) is a useful way to keep servers busy.  
I've seen view builds on oversharded Cloudant machines push EC2 ephemeral disk 
utilization above 95% and CPU utilization (including IOwait) above 180% on an 
m1.large.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Paul Joseph Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798734#action_12798734
 ] 

Paul Joseph Davis commented on COUCHDB-620:
---

Much of CouchDB is written specifically so it does not saturate a single 
server's resources. That was done quite on purpose so as to not ping an entire 
server that may be serving clients. Granted view generation is still slowish, 
though getting better.

I'm becoming more convinced that to get to the truly fast view generation that 
everyone wants we're going to need to change some pretty fundamental things in 
view storage and generation. I don't think that's necessarily bad if it can be 
done simply, but its just not something anyone's gone and tried AFAIK. I was 
looking at a couple different storage things but nothing was stable enough for 
me to even do the work to try integrating.

Also, patches welcome.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798711#action_12798711
 ] 

Roger Binns commented on COUCHDB-620:
-

I currently have a 4 core machine (and 2 cores until a few weeks ago).  Less 
than one core in total is used between couchjs and beam.smp (erlang).  Even if 
the changes in 495 make it go twice as fast, the performance is still abysmal - 
I'd still be measuring view generation time in hours and I'd still have mostly 
idle CPU and I/O.  Even having there be one less couchjs than cores would be 
fine if you are that worried.  (The ideal number would depend on division  of 
labour during the view generation between erlang and couchjs.)

In summary please saturate something - CPU, RAM or I/O when doing view 
generation.  Anything less than that means that the full resources of the 
machine are not being used and there is extra unnecessary wait for the view 
generation completion.  I'd really like to stop measuring the time taken in 
hours.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Adam Kocoloski (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798691#action_12798691
 ] 

Adam Kocoloski commented on COUCHDB-620:


Hi Roger, some of this work has already been implemented in trunk.  In 
particular, the next release uses separate work queues for mapping and reducing 
documents, so couchjs spends much less time idle.

I'm not sure adding extra couchjs instances will be much of a win in general, 
as for many simple views couchjs can more than keep up with what the DB feeds 
it, and JS engines are only getting faster.  There's also a balancing act 
between handing cores to beam.smp and handing them to couchjs.  Erlang can 
certainly make use of multiple cores when the DB is experiencing a bunch of 
concurrent requests, and I worry that all the context switches involved with 1 
couchjs/core might actually hurt performance.  I'd love to see a patch and some 
benchmarks, though.

See COUCHDB-495 for some of the view server performance work.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-11 Thread Roger Binns (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798612#action_12798612
 ] 

Roger Binns commented on COUCHDB-620:
-

There is no parallelizing at all.  My comment does say that :-)

Although doing it in parallel (or by number of CPU cores) will improve things, 
there is still lots more to be done.  (Going from 4 hours to 2 hours is still 
far too long.)  The various latencies add up a lot.  While CouchDB is 
considering which document to get view info for next, the couchjs process is 
sitting there idle.  While couchjs is processing a doc, CouchDB sits there 
idle.  Each side doing work while the other side is also working will 
significantly reduce the latencies and increase document processing throughput.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-620) Generating views is extremely slow - makes CouchDB hard to use with non-trivial number of docs

2010-01-10 Thread Dirkjan Ochtman (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798602#action_12798602
 ] 

Dirkjan Ochtman commented on COUCHDB-620:
-

I was recently thinking about this and was wondering if CouchDB tries to 
parallellize view indexing to different CPU's at all. It seems the whole 
map-reduce paradigm was invented exactly to make it easy to run concurrently.

> Generating views is extremely slow - makes CouchDB hard to use with 
> non-trivial number of docs
> --
>
> Key: COUCHDB-620
> URL: https://issues.apache.org/jira/browse/COUCHDB-620
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: 0.10
> Environment: Ubuntu 9.10 64 bit, CouchDB 0.10
>Reporter: Roger Binns
>
> Generating views is extremely slow.  For example adding 10 million documents 
> takes less than 10 minutes but generating some simple views on the same docs 
> takes over 4 hours.
> Using top you can see that CouchDB (erlang) and couchjs between them cannot 
> even saturate a single CPU let alone the I/O system.  Under ideal conditions 
> performance should be limited by cpu, disk or memory.  This implies that the 
> processes are doing simple things in lockstep accumulating latencies in each 
> process as well as the communication between them which when multiplied by 
> the number of documents can amount to a lot.
> Some suggestions:
> * Run as many couchjs instances as there are processor cores and scatter work 
> amongst them
> * Have some sort of pipelining in the erlang so that the moment the first 
> byte of response is received from couchjs the data is sent for the next 
> request (the JSON conversion, HTTP headers etc should all have been assembled 
> already) to reduce latencies.  Do whatever is most similar in couchjs (eg use 
> separate threads to read requests, process them and write responses).
> * Use the equivalent of HTTP pipelining when talking to couchjs so that it 
> always has a doc ready to work on rather than having to transmit an entire 
> response and then wait for erlang to think and provide an entire new request
> A simple test of success is to have a database with a million or so documents 
> with a trivial view and have view creation max out the CPU,. memory or disk.
> Some things in CouchDB make this a particularly nasty problem.  View data is 
> not replicated so replicating documents can lead the view data by a large 
> margin on the recipient database.  This can lead to inconsistencies.  You 
> also can't expect users to then wait minutes (or hours) for a request to 
> complete because the view generation got that far behind.  (My own plans now 
> are to not use replication and instead create the database file on another 
> couchdb instance and then rsync the binary database file over instead!)
> Although stale=ok is available, you still have no idea if the response will 
> be quick or take however long view generation does.  (Sure I could add some 
> sort of timeout and complicate the code but then what value do I pick?  If I 
> have a user waiting I want an answer ASAP or I have to give them some 
> horrible error message.  Taking a long wait and then giving a timeout is even 
> worse!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.