On Jul 2, 2008, at 16:17, Brad King wrote:

Just to post some results here of working with around 300K docs. I
changed the view to emit only the doc ID and index time went down to
about 25 minutes vs. an hour for the same dataset.

I then converted the largest text field to an attachment and things
went down hill from there. I deleted the db and started the upload,
but repeatedly got random 500 server errors with no real way to know
what is happening or why. Also the DB size as reported by Futon seemed
to fluctuate wildly as I was adding documents. And I mean wildly like
anywhere from 1.2G then back down to 144M. Weird. I don't get a very
warm fuzzy feeling about the stability of using attachments right now.
Ideally, I don't want to use them anyway, I'd prefer to have the
fields all inline and have the database handle these docs as-is. I
don't see these as huge documents (2 to 5K) as compared to what I
would store in something like Berkeley DB XML, just for comparison
sake, so I'm hoping its a goal of the project to handle these
effectively, even when several million documents are added.

This doesn't sound right at all. Can you make sure you use the
very latest SVN version or the 0.8 release and completely
new databases? Also, just to clarify, do you emit the doc into
the view payload? As in emit(doc._id, doc); are you just doing
emit(null, null); to only get the docIds that matter to you and
then fetch the documents later? I have had the later setup running
without any problems across ~2mio documents in a database.


As always, thanks for the help.

Thanks for the problem report.

Cheers
Jan
--





On Tue, Jul 1, 2008 at 9:26 AM, Brad King <[EMAIL PROTECTED]> wrote:
Thanks for the tips. I'll start scaling back the data I'm returning
and see if it improves. The largest field is an html description of an
inventory item, which seems like a good candidate for a binary
attachment, but I need to be able to do full text searches on this
data eventually (hopefully with the Lucene integration) so I'll
probably try just not including the document data in the views first.
We've had some success with Lucene independent of couchdb, so I'm
pleased you guys are integrating this.

On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz <[EMAIL PROTECTED]> wrote:
Part of the problem is you are storing copies of the documents into the btree. If the documents are big, it takes longer to compute on them, and if the results (emit(...)) are big or numerous, then you'll be spending most of
your time in I/O.

My advice is to not emit the document into the view, and if you can, get the
documents smaller in general. If the data can stored as an binary
attachment, then that too will give you a performance improvement.

-Damien

On Jun 20, 2008, at 4:51 PM, Brad King wrote:

Thanks, yes its currently at 357M and growing!

On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson <[EMAIL PROTECTED]> wrote:

Brad,

You can look at

ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/

to see the view size growing...

It won't tell you when it's done but it will give you hope that the
progress is happening.

Chris

On Fri, Jun 20, 2008 at 1:45 PM, Brad King <[EMAIL PROTECTED]> wrote:

I have about 350K documents in a database. typically around 5K each. I
created and saved a view which simply looks at one field in the
document. I called the view for the first time with a key that should only match one document, and its been awaiting a response for about 45
minutes now.

{
"sku": {
   "map": "function(doc) { emit(doc.entityobject.SKU, doc); }"
}
}

Is this typical, or is there some optimizing to be done on either my view or the server? I'm also running on a VM so this may have some
effects, but smaller databases seem to be performing pretty well.
Insert times to set this up were actually really good I thought, at
4000 to 5000 documents per minute running from my laptop.




--
Chris Anderson
http://jchris.mfdz.com






Reply via email to