Re: View Filter
On Wed, May 13, 2009 at 09:41:29AM -0500, Zachary Zolton wrote: So, this sounds like a big win for those who like to store many document types in the same database with a type descriminator field. ... but only if all views in the same design doc are filtered by the same set of types. That is, you can only use it to exclude documents which are not used by *any* view. Therefore the benefit is for: (1) people who are storing large documents in CouchDB but not indexing them at all (I guess this is possible, e.g. if the doc ids are well-known or stored in other documents, but this isn't the most common way of working) (2) people who have a separate design document for each type of object. They would most likely get the same or better performance benefit by having a single design document with all their views. I also think there are other pinch-points in view generation which need working on, although perhaps they are not as quick wins as this one. For example, on my old Thinkpad X30 (mobile P3 1.2GHz), I can insert a set of 1300 documents in ~2 secs using _bulk_docs. However the first view request (generating ~6000 keys) takes around 35 seconds to respond. Regards, Brian.
Allow overridden request methods
That's a great point, DELETE does often get ignored. I like the idea of having a reserved property in json, but it still relies on your ability to push json to couchdb. Somehow the core REST interface should allow this as well. (Once again, sorry if this doesn't fit in the thread right, I sent a message to the list admin, so hopefully I can get this worked out :P )
Re: Allow overridden request methods
On Thu, May 14, 2009 at 06:45:28AM -0500, Jared Scheel wrote: That's a great point, DELETE does often get ignored. I like the idea of having a reserved property in json, but it still relies on your ability to push json to couchdb. So, you want to be able to replace a PUT with a POST, in some part of the API where the body is neither JSON nor a HTML FORM upload. The only example I can think of is using PUT to upload an attachment, and I thought there was already a POST multipart/form-data alternative for that. What else have I forgotten?
Re: Allow overridden request methods
On Thu, May 14, 2009 at 07:08:21AM -0500, Jared Scheel wrote: Hmm, didn't think about that. I guess you will need to have some kind of request body anyways. Do you think that the request method should be set in the data though, or should it be set in the header, since the method/header defines how the data should be interpreted? I know that's just splitting hairs, but there was some concern about keeping the api as clean as possible. I think the whole point is to support minimal HTTP clients. If they don't support PUT or DELETE, there's a good chance they don't support custom HTTP headers either.
Re: Allow overridden request methods
Does anyone have an example of an HTTP client that doesn't support changing or adding headers? I certainly can't think of any. Limiting request methods is one thing but not allowing you to modify headers is pretty crippling. -Mikeal On May 14, 2009, at May 14, 20096:29 AM, Brian Candler wrote: On Thu, May 14, 2009 at 07:08:21AM -0500, Jared Scheel wrote: Hmm, didn't think about that. I guess you will need to have some kind of request body anyways. Do you think that the request method should be set in the data though, or should it be set in the header, since the method/header defines how the data should be interpreted? I know that's just splitting hairs, but there was some concern about keeping the api as clean as possible. I think the whole point is to support minimal HTTP clients. If they don't support PUT or DELETE, there's a good chance they don't support custom HTTP headers either.
Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)
Hi Paul, On May 13, 2009, at 4:01 PM, Paul Davis wrote: Adam, No worries about the delay. I'd agree that the first graph doesn't really show much than *maybe* we can say the patch reduces the variability a bit. Agreed. The variance on the wallclock measurements is frustrating. I started a thread about CPU time profiling on erlang-questions, as it seems to be automatically disabled on Linux and not even implemented on OS X. I think wallclock time is generally the right way to go, but if we're testing two implementations of an algorithm that doesn't touch the disk then CPU time seems like the better metric to me. On the second graph, I haven't the faintest why that'd be as such. I'll have to try and setup fprof and see if I can figure out what exactly is taking most of the time. I should clean up and post the code I used to make these measurements. I wanted to just profile the couch_db_updater process, so I disabled hovercraft's delete/create DB calls at the start of lightning() and did them in my test code (so the updater PID didn't change on me). Here's the fprof analysis from a trunk run -- in this case I believe it was 10k docs @ 1k bulk: http://friendpaste.com/16xwiZuqwWrqYXQeaBS0fx There's a lot of detail there, but if you stare at it awhile a few details start to emerge: - 11218 ms total time spent in the emulator (line 13) - 10028 ms in couch_db_updater:update_docs_int/3 and funs called by it (line 47) - 3682 ms in couch_btree:add_remove/3 (line 70) - 1941 ms in couch_btree:lookup/2 (line 104) - 1262 ms in couch_db_updater:flush_trees/3 (line 265) - 951 ms in couch_db_updater:merge_rev_trees/7 (line 322) - 910 ms in couch_db_updater:commit_data/2 (line 330) and so on. Each of those numbers is an ACC that includes all the time spent in functions called by that function. The five functions I listed have basically zero overlap, so I'm not double-counting. You can certainly drill deeper and see which functions take the most time inside add_remove/3, etc. Perhaps we're looking at wrong thing by reducing term_to_binary. You did say that most of the time was spent in size/1 as opposed to term_to_binary the other day which is hard to believe at best. Agreed, that's pretty difficult to believe. Here's what I saw -- I defined a function chunkify_sizer(Elem, AccIn) - Size = size(term_to_binary(Elem)), {{Elem, Size}, AccIn+Size}. Here's the profile for that function % CNT ACC OWN {[{{couch_btree,'-chunkify/1-fun-0-',2}, 22163, 268.312, 154.927}], { {couch_btree,chunkify_sizer,2}, 22163, 268.312, 154.927}, % [{{erlang,term_to_binary,1}, 22163, 108.671, 108.671}, {garbage_collect, 9,4.714, 4.714}]}. 60% of the time is spent in the body of chunkify_sizer, and only 40% in term_to_binary. I've never seen size/1 show up explicitly in the profile, but if I define a dummy wrapper get_size(Bin) - erlang:size(Bin). then get_size/1 will show up with a large OWN time, so I conclude that any time spent sizing a binary gets charged to OWN. You know BERT better than I do -- you said the size of a binary is stored in its header, correct? I'll put this on the weekend agenda. Until I can show that its consistently faster I'll hold off. For reference, when you say 2K docs in batches of 1K, did you mean 200K? No, I meant 2k (2 calls to _bulk_docs). 200k would have generated a multi-GB trace and I think fprof:profile() would have melted my MacBook processing it. YMMV ;-) Also, note to self, we should check speeds for dealing with uuids too to see if the non-ordered mode makes a difference. Agreed. At the moment fprof seems much better suited to identifying hot spots in code than comparing alternative implementations of a function. Best thing I've come up with so far is comparing ratios of time spent in the function (as in Figure 2 above). Paul On Wed, May 13, 2009 at 3:33 PM, Adam Kocoloski kocol...@apache.org wrote: Sorry for the delay on this front. I ran hovercraft:lightning 20 times each with and without Paul's patch. Each run inserted 2k docs in batches of 1000. Here are two plots showing the effect of the patch: http://dl.getdropbox.com/u/237885/insert_rate.png http://dl.getdropbox.com/u/237885/chunkify_fraction.png The first plot histograms the insert rate for the two scenarios*. I don't really see much of a difference. The second plot uses fprof to plot the fraction of time the couch_db_updater process spent in chunkify and any functions called by chunkify. For those familiar with fprof, it's the ratio of ACC for couch_btree:chunkify/2 divided by OWN for the updater pid. If fprof is to be believed, the trunk code is almost 2x faster. Adam * the reason the insert rate is so low is because fprof
Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)
You know BERT better than I do -- you said the size of a binary is stored in its header, correct? I'm not sure now. It may only get length information when being sent across the wire. I'll put this on the weekend agenda. Until I can show that its consistently faster I'll hold off. For reference, when you say 2K docs in batches of 1K, did you mean 200K? No, I meant 2k (2 calls to _bulk_docs). 200k would have generated a multi-GB trace and I think fprof:profile() would have melted my MacBook processing it. YMMV ;-) I thought you knew the guys at Cern ;) Thanks for writing this up and do please post code somewhere. This weekend I'll take a bit of time to see if I can weasel anything better out of the fprof stuff. Paul Davis
Re: Ordering of keys into reduce function
On Thu, May 14, 2009 at 9:30 AM, Brian Candler b.cand...@pobox.com wrote: On Wed, May 13, 2009 at 10:08:05AM -0400, Paul Davis wrote: (*) An alternative would be to do two queries: startkey=aaalimit=1 and endkey=bbblimit=1descending=true. I would like to avoid two queries, and I'd also like this functionality for group_level=n, such that within each group I know the minimum and maximum key. You mean the minimum and maximum value? No, I mean the minimum and maximum keys (in the some key range aaa to bbb). Then emit your keys as values too :D
[jira] Commented: (COUCHDB-321) Futon breaks when used with a reverse proxy
[ https://issues.apache.org/jira/browse/COUCHDB-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709473#action_12709473 ] Jack Moffitt commented on COUCHDB-321: -- Mike, that looks great. Perhaps you could attach a modified patch that moved the logic to $.ajaxsetup? Futon breaks when used with a reverse proxy --- Key: COUCHDB-321 URL: https://issues.apache.org/jira/browse/COUCHDB-321 Project: CouchDB Issue Type: Bug Components: Administration Console Affects Versions: 0.9 Environment: Affects all platforms. Reporter: Jack Moffitt Priority: Minor Attachments: futon.patch It is often convenient to reverse proxy CouchDB at a url like /couch. Unfortunately, while CouchDB will work perfectly in this situation, Futon cannot handle it as jquery.couch.js uses absolute URLs to access all CouchDB functions. I've attached a small patch that fixes this problem by: 1. Adding a urlPrefix attribute to $.couch which it uses to construct its URLs. 2. Adding logic to the futon.js constructor that figures out a suitable prefix and sets $.couch.urlPrefix to use this. Any client code that makes use of $.couch will need to do something similar. Since only the application and the adminstrator will know what the prefix should be or how to deduce it, I didn't really know of a better way to handle this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: View Filter
On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote: (1) people who are storing large documents in CouchDB but not indexing them at all (I guess this is possible, e.g. if the doc ids are well-known or stored in other documents, but this isn't the most common way of working) The proposal would exclude a document from *all* views in a particular design doc. So you're only going to get a benefit from this if you have a large number of documents (or a number of large documents) which are not required to be indexed in any view in that design doc. I do agree, though, that only being able to filter at the design doc level limits the utility of view filtering. And it's reasonable, given that (as I understand it) each document is already only passed once to the view server, in order to be indexed by all the views in that design document. Given that a design doc is supposed to be an application's view of the database, would we want to encourage folks to make a different design doc for each type of data they store in the database? My gut says one design doc per application —but I could be all mixed up! I have been ending up with views which run across *all* the documents in a database - for example, a generic search box which lets the user type in a search term and hit any matching type of object. Having a single design document holding all my views means that each document only needs to be sent once to the view server. Regards, Brian.
Re: Allow overridden request methods
On Thu, May 14, 2009 at 07:23:50AM -0700, Mikeal Rogers wrote: Does anyone have an example of an HTTP client that doesn't support changing or adding headers? I was thinking of a web browser, with Javascript disabled, submitting a FORM. But the other point is right - you can have a FORM which POSTS to _bulk_docs.
Re: Re: Allow overridden request methods
Sorry guys, I'm really new to couchdb, but it seems like _bulk_docs is side-stepping the issue by using something in a way it wasn't really intended to be used. Shouldn't there be some fairly simple and decoupled way to set the proper request method if you are limited by your client? It seems easy to say if your client doesn't support a request method, simply POST the request and set the x-http-method-override header to the real request method you want to use. This way to don't have to change anything else in your process, you just have to add the appropriate header. Thanks! -Jared On May 14, 2009 3:24pm, Brian Candler b.cand...@pobox.com wrote: On Thu, May 14, 2009 at 07:23:50AM -0700, Mikeal Rogers wrote: Does anyone have an example of an HTTP client that doesn't support changing or adding headers? I was thinking of a web browser, with Javascript disabled, submitting a . But the other point is right - you can have a FORM which POSTS to _bulk_docs.
Re: View Filter
Moreover, many of my attempts to have different types of docs in one database (for joins, etc) have ended up with my moving them into separate databases. It's been pretty easy (most of the time) to do that work in my Ruby code!
Re: View Filter
On 15/05/2009 4:47 AM, Brian Candler wrote: On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote: (1) people who are storing large documents in CouchDB but not indexing them at all (I guess this is possible, e.g. if the doc ids are well-known or stored in other documents, but this isn't the most common way of working) The proposal would exclude a document from *all* views in a particular design doc. So you're only going to get a benefit from this if you have a large number of documents (or a number of large documents) which are not required to be indexed in any view in that design doc. Yep - and that is the point. Consider Jan's example, where it was filtering on doc['type']. If a database had (say) 10 potential values of 'type', then all filters that only care about a single type will only care about 1 in 10 of those documents. Taking this to its extreme, we tested Jan's patch on a view which matches very few document in a large database. Rebuilding that view with a filter was 18 times faster than without the filter. We put this down to the fact the filter managed to avoid the json encode/decode step for the vast majority of the docs in the database. IOW, on my test database, 6 minutes is spent before the filters can actually do anything (ie, that is just the json processing), whereas using the filter to avoid that json step brings it down to 20 seconds. So while not everyone will be able to see such significant speedups, many may find it extremely useful. And it's reasonable, given that (as I understand it) each document is already only passed once to the view server, in order to be indexed by all the views in that design document. I agree there is lots that can and should be done to speed up views that do indeed care about most of the docs - such views spend less time relatively in the json encode step and more time in the interpreter. As an experiment, I ported one of our views that does look at most of the docs from javascript to erlangview, and the performance increase was far more modest (20% maybe). I suspect the javascript interpreter is faster than erlang, so I suspect that there will be a level of view complexity where using javascript *increases* view performance over erlang, even when factoring in the json processing... Cheers, Mark
Re: View Filter
Drat... I actually may just came from place where knowing how to keep my doc types in separate databases —and being able to speed up the map-reduce churn of querying a reduce-with-group query with view filters— would have save me a TON of work! Urgh... At worst, I'll put it in my blog... :^( On Thu, May 14, 2009 at 8:25 PM, Mark Hammond skippy.hamm...@gmail.com wrote: On 15/05/2009 4:47 AM, Brian Candler wrote: On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote: (1) people who are storing large documents in CouchDB but not indexing them at all (I guess this is possible, e.g. if the doc ids are well-known or stored in other documents, but this isn't the most common way of working) The proposal would exclude a document from *all* views in a particular design doc. So you're only going to get a benefit from this if you have a large number of documents (or a number of large documents) which are not required to be indexed in any view in that design doc. Yep - and that is the point. Consider Jan's example, where it was filtering on doc['type']. If a database had (say) 10 potential values of 'type', then all filters that only care about a single type will only care about 1 in 10 of those documents. Taking this to its extreme, we tested Jan's patch on a view which matches very few document in a large database. Rebuilding that view with a filter was 18 times faster than without the filter. We put this down to the fact the filter managed to avoid the json encode/decode step for the vast majority of the docs in the database. IOW, on my test database, 6 minutes is spent before the filters can actually do anything (ie, that is just the json processing), whereas using the filter to avoid that json step brings it down to 20 seconds. So while not everyone will be able to see such significant speedups, many may find it extremely useful. And it's reasonable, given that (as I understand it) each document is already only passed once to the view server, in order to be indexed by all the views in that design document. I agree there is lots that can and should be done to speed up views that do indeed care about most of the docs - such views spend less time relatively in the json encode step and more time in the interpreter. As an experiment, I ported one of our views that does look at most of the docs from javascript to erlangview, and the performance increase was far more modest (20% maybe). I suspect the javascript interpreter is faster than erlang, so I suspect that there will be a level of view complexity where using javascript *increases* view performance over erlang, even when factoring in the json processing... Cheers, Mark