Late to the discussion but here's my 2 cents: Depending on your virtualization software, disk accesses can suck. On a "hosted" hypervisor, you're to have to rely on the host to schedule your disk accesses. Disk io is scheduled in the guest, potentially go through an emulation layer by the hypervisor, and then be scheduled in the host. Furthermore there can be significant latency switching between the host and the guest. If the disk accesses are small and random this can cause the slowdown you are observing. Finally, your guest is not always scheduled in since it's just like any other processes to the host, so the actual amount of cpu time in the guest is less than you normally have and will affect the total wall clock of the computation time.
I'm not saying that virtualization sucks as it has many important uses (e.g. VMotion), and some of these issues may be mitigated with proper paravirtualization, but at the end you should still run benchmarks to see if your workload is suited for the hypervisor you are considering. On Tue, Jul 8, 2008 at 6:53 AM, Brad King <[EMAIL PROTECTED]> wrote: > Following up on this. After moving to real hardware my view index time > for the same data set dropped from 25 minutes to 6 minutes, so > definitely was a factor. If there any other optimizations I can make > I'd love to know what they are. Thanks. > > On Thu, Jul 3, 2008 at 9:35 AM, Brad King <[EMAIL PROTECTED]> wrote: >> That would be fantastic, but it sounds like other users are seeing >> performance similar to what I see. When you say tuning and >> optimizations, are you talking about code changes in future versions >> of couchdb or parameters we can change now? VM is definitely a >> variable. I probably should try this out on real hardware too and >> compare. >> >> On Wed, Jul 2, 2008 at 7:30 PM, Damien Katz <[EMAIL PROTECTED]> wrote: >>> This sounds really slow, like somethings wrong. 25 minutes to process 300k >>> means ~500 docs sec, or each document takes 2ms. That's a really long time >>> CPU wise. >>> >>> Assuming it's not another VM bug, we should be able about to get that down >>> to under minute with some tuning, and probably closer to 10 secs after >>> serious optimizations. >>> >>> -Damien >>> >>> >>> On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote: >>> >>>> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> I'd have to go back and double check, but off the top of my head 25 >>>>> min for 300K docs seems about like what I was getting. Ie, not orders >>>>> of magnitude slower or anything. >>>> >>>> In my experience, views generate about 1/2 as fast as that, if not >>>> more slowly. My views are often quite complex with a lot of internal >>>> looping and multiple emits, so that probably explains it. In short, >>>> the times you're reporting seem reasonable. >>>> >>>> The bottleneck (based on my extremely unscientific use of top) doesn't >>>> seem to be the view server, but rather CouchDB's beam process, which >>>> as I understand it, is busy sorting the results as they come back from >>>> the view server. So the quickest route to parallelizing this may be to >>>> manually partition your data across CouchDB instances, generate the >>>> views, and query them in parallel, merging the results in your >>>> application. >>>> >>>> I don't actually plan to do all that work until my insert rate >>>> eclipses CouchDB's view generation speed. :) >>>> >>>> Once upon a time there was a feature to return the available results >>>> of a view, even while generation is still occurring. The feature has >>>> fallen by the wayside, and it would be non-trivial to turn it back on, >>>> according to Damien on IRC. Maybe if it would be useful to enough >>>> people, we'll see it again. >>>> >>>> -- >>>> Chris Anderson >>>> http://jchris.mfdz.com >>> >>> >> >
