Re: View Filter

2009-05-14 Thread Brian Candler
On Wed, May 13, 2009 at 09:41:29AM -0500, Zachary Zolton wrote:
 So, this sounds like a big win for those who like to store many
 document types in the same database with a type descriminator field.

... but only if all views in the same design doc are filtered by the same
set of types. That is, you can only use it to exclude documents which are
not used by *any* view. Therefore the benefit is for:

(1) people who are storing large documents in CouchDB but not indexing them
at all (I guess this is possible, e.g. if the doc ids are well-known or
stored in other documents, but this isn't the most common way of working)

(2) people who have a separate design document for each type of object.
They would most likely get the same or better performance benefit by having
a single design document with all their views.

I also think there are other pinch-points in view generation which need
working on, although perhaps they are not as quick wins as this one.

For example, on my old Thinkpad X30 (mobile P3 1.2GHz), I can insert a set
of 1300 documents in ~2 secs using _bulk_docs. However the first view
request (generating ~6000 keys) takes around 35 seconds to respond.

Regards,

Brian.


Allow overridden request methods

2009-05-14 Thread Jared Scheel
That's a great point, DELETE does often get ignored. I like the idea
of having a reserved property in json, but it still relies on your
ability to push json to couchdb. Somehow the core REST interface
should allow this as well. (Once again, sorry if this doesn't fit in
the thread right, I sent a message to the list admin, so hopefully I
can get this worked out :P )


Re: Allow overridden request methods

2009-05-14 Thread Brian Candler
On Thu, May 14, 2009 at 06:45:28AM -0500, Jared Scheel wrote:
 That's a great point, DELETE does often get ignored. I like the idea
 of having a reserved property in json, but it still relies on your
 ability to push json to couchdb.

So, you want to be able to replace a PUT with a POST, in some part of the
API where the body is neither JSON nor a HTML FORM upload.

The only example I can think of is using PUT to upload an attachment, and I
thought there was already a POST multipart/form-data alternative for that.

What else have I forgotten?


Re: Allow overridden request methods

2009-05-14 Thread Brian Candler
On Thu, May 14, 2009 at 07:08:21AM -0500, Jared Scheel wrote:
 Hmm, didn't think about that. I guess you will need to have some kind
 of request body anyways. Do you think that the request method should
 be set in the data though, or should it be set in the header, since
 the method/header defines how the data should be interpreted? I know
 that's just splitting hairs, but there was some concern about keeping
 the api as clean as possible.

I think the whole point is to support minimal HTTP clients. If they don't
support PUT or DELETE, there's a good chance they don't support custom HTTP
headers either.


Re: Allow overridden request methods

2009-05-14 Thread Mikeal Rogers
Does anyone have an example of an HTTP client that doesn't support  
changing or adding headers?


I certainly can't think of any. Limiting request methods is one thing  
but not allowing you to modify headers is pretty crippling.


-Mikeal

On May 14, 2009, at May 14, 20096:29 AM, Brian Candler wrote:


On Thu, May 14, 2009 at 07:08:21AM -0500, Jared Scheel wrote:

Hmm, didn't think about that. I guess you will need to have some kind
of request body anyways. Do you think that the request method should
be set in the data though, or should it be set in the header, since
the method/header defines how the data should be interpreted? I know
that's just splitting hairs, but there was some concern about keeping
the api as clean as possible.


I think the whole point is to support minimal HTTP clients. If they  
don't
support PUT or DELETE, there's a good chance they don't support  
custom HTTP

headers either.




Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)

2009-05-14 Thread Adam Kocoloski

Hi Paul,

On May 13, 2009, at 4:01 PM, Paul Davis wrote:


Adam,

No worries about the delay. I'd agree that the first graph doesn't
really show much than *maybe* we can say the patch reduces the
variability a bit.


Agreed.  The variance on the wallclock measurements is frustrating.  I  
started a thread about CPU time profiling on erlang-questions, as it  
seems to be automatically disabled on Linux and not even implemented  
on OS X.  I think wallclock time is generally the right way to go, but  
if we're testing two implementations of an algorithm that doesn't  
touch the disk then CPU time seems like the better metric to me.



On the second graph, I haven't the faintest why that'd be as such.
I'll have to try and setup fprof and see if I can figure out what
exactly is taking most of the time.


I should clean up and post the code I used to make these  
measurements.  I wanted to just profile the couch_db_updater process,  
so I disabled hovercraft's delete/create DB calls at the start of  
lightning() and did them in my test code (so the updater PID didn't  
change on me).


Here's the fprof analysis from a trunk run -- in this case I believe  
it was 10k docs @ 1k bulk:


http://friendpaste.com/16xwiZuqwWrqYXQeaBS0fx

There's a lot of detail there, but if you stare at it awhile a few  
details start to emerge:


- 11218 ms total time spent in the emulator (line 13)
  - 10028 ms in couch_db_updater:update_docs_int/3 and funs called by  
it (line 47)

- 3682 ms in couch_btree:add_remove/3 (line 70)
- 1941 ms in couch_btree:lookup/2 (line 104)
- 1262 ms in couch_db_updater:flush_trees/3 (line 265)
-  951 ms in couch_db_updater:merge_rev_trees/7 (line 322)
-  910 ms in couch_db_updater:commit_data/2 (line 330)

and so on.  Each of those numbers is an ACC that includes all the time  
spent in functions called by that function.  The five functions I  
listed have basically zero overlap, so I'm not double-counting.  You  
can certainly drill deeper and see which functions take the most time  
inside add_remove/3, etc.



Perhaps we're looking at wrong
thing by reducing term_to_binary. You did say that most of the time
was spent in size/1 as opposed to term_to_binary the other day which
is hard to believe at best.


Agreed, that's pretty difficult to believe.  Here's what I saw -- I  
defined a function


chunkify_sizer(Elem, AccIn) -
Size = size(term_to_binary(Elem)),
{{Elem, Size}, AccIn+Size}.

Here's the profile for that function

%   CNT   ACC   OWN
{[{{couch_btree,'-chunkify/1-fun-0-',2},   22163,  268.312,   
154.927}],
 { {couch_btree,chunkify_sizer,2}, 22163,  268.312,   
154.927}, %
 [{{erlang,term_to_binary,1},  22163,  108.671,   
108.671},
  {garbage_collect,   9,4.714, 
4.714}]}.


60% of the time is spent in the body of chunkify_sizer, and only 40%  
in term_to_binary.  I've never seen size/1 show up explicitly in the  
profile, but if I define a dummy wrapper


get_size(Bin) -
erlang:size(Bin).

then get_size/1 will show up with a large OWN time, so I conclude that  
any time spent sizing a binary gets charged to OWN.


You know BERT better than I do -- you said the size of a binary is  
stored in its header, correct?



I'll put this on the weekend agenda. Until I can show that its
consistently faster I'll hold off.

For reference, when you say 2K docs in batches of 1K, did you mean  
200K?


No, I meant 2k (2 calls to _bulk_docs).  200k would have generated a  
multi-GB trace and I think fprof:profile() would have melted my  
MacBook processing it.  YMMV ;-)



Also, note to self, we should check speeds for dealing with uuids too
to see if the non-ordered mode makes a difference.


Agreed.  At the moment fprof seems much better suited to identifying  
hot spots in code than comparing alternative implementations of a  
function.  Best thing I've come up with so far is comparing ratios of  
time spent in the function (as in Figure 2 above).



Paul

On Wed, May 13, 2009 at 3:33 PM, Adam Kocoloski  
kocol...@apache.org wrote:
Sorry for the delay on this front.  I ran hovercraft:lightning 20  
times each
with and without Paul's patch.  Each run inserted 2k docs in  
batches of

1000.  Here are two plots showing the effect of the patch:

http://dl.getdropbox.com/u/237885/insert_rate.png
http://dl.getdropbox.com/u/237885/chunkify_fraction.png

The first plot histograms the insert rate for the two scenarios*.   
I don't
really see much of a difference.  The second plot uses fprof to  
plot the
fraction of time the couch_db_updater process spent in chunkify and  
any
functions called by chunkify.  For those familiar with fprof, it's  
the ratio
of ACC for couch_btree:chunkify/2 divided by OWN for the updater  
pid.  If

fprof is to be believed, the trunk code is almost 2x faster.

Adam

* the reason the insert rate is so low is because fprof 

Re: chunkify profiling (was Re: Patch to couch_btree:chunkify)

2009-05-14 Thread Paul Davis

 You know BERT better than I do -- you said the size of a binary is stored in
 its header, correct?


I'm not sure now. It may only get length information when being sent
across the wire.

 I'll put this on the weekend agenda. Until I can show that its
 consistently faster I'll hold off.

 For reference, when you say 2K docs in batches of 1K, did you mean 200K?

 No, I meant 2k (2 calls to _bulk_docs).  200k would have generated a
 multi-GB trace and I think fprof:profile() would have melted my MacBook
 processing it.  YMMV ;-)

I thought you knew the guys at Cern ;)

Thanks for writing this up and do please post code somewhere. This
weekend I'll take a bit of time to see if I can weasel anything better
out of the fprof stuff.

Paul Davis


Re: Ordering of keys into reduce function

2009-05-14 Thread Paul Davis
On Thu, May 14, 2009 at 9:30 AM, Brian Candler b.cand...@pobox.com wrote:
 On Wed, May 13, 2009 at 10:08:05AM -0400, Paul Davis wrote:
  (*) An alternative would be to do two queries: startkey=aaalimit=1 and
  endkey=bbblimit=1descending=true. I would like to avoid two queries, and
  I'd also like this functionality for group_level=n, such that within each
  group I know the minimum and maximum key.
 

 You mean the minimum and maximum value?

 No, I mean the minimum and maximum keys (in the some key range aaa to bbb).


Then emit your keys as values too :D


[jira] Commented: (COUCHDB-321) Futon breaks when used with a reverse proxy

2009-05-14 Thread Jack Moffitt (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709473#action_12709473
 ] 

Jack Moffitt commented on COUCHDB-321:
--

Mike, that looks great.  Perhaps you could attach a modified patch that moved 
the logic to $.ajaxsetup?

 Futon breaks when used with a reverse proxy
 ---

 Key: COUCHDB-321
 URL: https://issues.apache.org/jira/browse/COUCHDB-321
 Project: CouchDB
  Issue Type: Bug
  Components: Administration Console
Affects Versions: 0.9
 Environment: Affects all platforms.
Reporter: Jack Moffitt
Priority: Minor
 Attachments: futon.patch


 It is often convenient to reverse proxy CouchDB at a url like /couch.  
 Unfortunately, while CouchDB will work perfectly in this situation, Futon 
 cannot handle it as jquery.couch.js uses absolute URLs to access all CouchDB 
 functions.
 I've attached a small patch that fixes this problem by:
 1. Adding a urlPrefix attribute to $.couch which it uses to construct its 
 URLs.
 2. Adding logic to the futon.js constructor that figures out a suitable 
 prefix and sets $.couch.urlPrefix to use this.
 Any client code that makes use of $.couch will need to do something similar.  
 Since only the application and the adminstrator will know what the prefix 
 should be or how to deduce it, I didn't really know of a better way to handle 
 this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: View Filter

2009-05-14 Thread Brian Candler
On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote:
 (1) people who are storing large documents in CouchDB but not indexing them
 at all (I guess this is possible, e.g. if the doc ids are well-known or
 stored in other documents, but this isn't the most common way of working)

The proposal would exclude a document from *all* views in a particular
design doc. So you're only going to get a benefit from this if you have a
large number of documents (or a number of large documents) which are not
required to be indexed in any view in that design doc.

 I do agree, though, that only being able to filter at the design doc
 level limits the utility of view filtering.

And it's reasonable, given that (as I understand it) each document is
already only passed once to the view server, in order to be indexed by all
the views in that design document.

 Given that a design doc is
 supposed to be an application's view of the database, would we want
 to encourage folks to make a different design doc for each type of
 data they store in the database? My gut says one design doc per
 application —but I could be all mixed up!

I have been ending up with views which run across *all* the documents in a
database - for example, a generic search box which lets the user type in a
search term and hit any matching type of object. Having a single design
document holding all my views means that each document only needs to be sent
once to the view server.

Regards,

Brian.


Re: Allow overridden request methods

2009-05-14 Thread Brian Candler
On Thu, May 14, 2009 at 07:23:50AM -0700, Mikeal Rogers wrote:
 Does anyone have an example of an HTTP client that doesn't support  
 changing or adding headers?

I was thinking of a web browser, with Javascript disabled, submitting a
FORM.

But the other point is right - you can have a FORM which POSTS to
_bulk_docs.


Re: Re: Allow overridden request methods

2009-05-14 Thread jared3d
Sorry guys, I'm really new to couchdb, but it seems like _bulk_docs is  
side-stepping the issue by using something in a way it wasn't really  
intended to be used. Shouldn't there be some fairly simple and decoupled  
way to set the proper request method if you are limited by your client? It  
seems easy to say if your client doesn't support a request method, simply  
POST the request and set the x-http-method-override header to the real  
request method you want to use. This way to don't have to change anything  
else in your process, you just have to add the appropriate header.


Thanks!
-Jared

On May 14, 2009 3:24pm, Brian Candler b.cand...@pobox.com wrote:

On Thu, May 14, 2009 at 07:23:50AM -0700, Mikeal Rogers wrote:



 Does anyone have an example of an HTTP client that doesn't support



 changing or adding headers?





I was thinking of a web browser, with Javascript disabled, submitting a



.





But the other point is right - you can have a FORM which POSTS to



_bulk_docs.




Re: View Filter

2009-05-14 Thread Zachary Zolton
Moreover, many of my attempts to have different types of docs in one
database (for joins, etc) have ended up with my moving them into
separate databases. It's been pretty easy (most of the time) to do
that work in my Ruby code!


Re: View Filter

2009-05-14 Thread Mark Hammond

On 15/05/2009 4:47 AM, Brian Candler wrote:

On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote:

(1) people who are storing large documents in CouchDB but not indexing them
at all (I guess this is possible, e.g. if the doc ids are well-known or
stored in other documents, but this isn't the most common way of working)


The proposal would exclude a document from *all* views in a particular
design doc. So you're only going to get a benefit from this if you have a
large number of documents (or a number of large documents) which are not
required to be indexed in any view in that design doc.


Yep - and that is the point.  Consider Jan's example, where it was 
filtering on doc['type'].  If a database had (say) 10 potential values 
of 'type', then all filters that only care about a single type will only 
care about 1 in 10 of those documents.


Taking this to its extreme, we tested Jan's patch on a view which 
matches very few document in a large database.  Rebuilding that view 
with a filter was 18 times faster than without the filter.  We put this 
down to the fact the filter managed to avoid the json encode/decode step 
for the vast majority of the docs in the database.  IOW, on my test 
database, 6 minutes is spent before the filters can actually do anything 
(ie, that is just the json processing), whereas using the filter to 
avoid that json step brings it down to 20 seconds.


So while not everyone will be able to see such significant speedups, 
many may find it extremely useful.



And it's reasonable, given that (as I understand it) each document is
already only passed once to the view server, in order to be indexed by all
the views in that design document.


I agree there is lots that can and should be done to speed up views that 
do indeed care about most of the docs - such views spend less time 
relatively in the json encode step and more time in the interpreter.  As 
an experiment, I ported one of our views that does look at most of the 
docs from javascript to erlangview, and the performance increase was far 
more modest (20% maybe).  I suspect the javascript interpreter is faster 
than erlang, so I suspect that there will be a level of view complexity 
where using javascript *increases* view performance over erlang, even 
when factoring in the json processing...


Cheers,

Mark


Re: View Filter

2009-05-14 Thread Zachary Zolton
Drat... I actually may just came from place where knowing how to keep
my doc types in separate databases —and being able to speed up the
map-reduce churn of querying a reduce-with-group query with view
filters— would have save me a TON of work!

Urgh... At worst, I'll put it in my blog...  :^(

On Thu, May 14, 2009 at 8:25 PM, Mark Hammond skippy.hamm...@gmail.com wrote:
 On 15/05/2009 4:47 AM, Brian Candler wrote:

 On Thu, May 14, 2009 at 09:53:14AM -0500, Zachary Zolton wrote:

 (1) people who are storing large documents in CouchDB but not indexing
 them
 at all (I guess this is possible, e.g. if the doc ids are well-known or
 stored in other documents, but this isn't the most common way of working)

 The proposal would exclude a document from *all* views in a particular
 design doc. So you're only going to get a benefit from this if you have a
 large number of documents (or a number of large documents) which are not
 required to be indexed in any view in that design doc.

 Yep - and that is the point.  Consider Jan's example, where it was filtering
 on doc['type'].  If a database had (say) 10 potential values of 'type', then
 all filters that only care about a single type will only care about 1 in 10
 of those documents.

 Taking this to its extreme, we tested Jan's patch on a view which matches
 very few document in a large database.  Rebuilding that view with a filter
 was 18 times faster than without the filter.  We put this down to the fact
 the filter managed to avoid the json encode/decode step for the vast
 majority of the docs in the database.  IOW, on my test database, 6 minutes
 is spent before the filters can actually do anything (ie, that is just the
 json processing), whereas using the filter to avoid that json step brings it
 down to 20 seconds.

 So while not everyone will be able to see such significant speedups, many
 may find it extremely useful.

 And it's reasonable, given that (as I understand it) each document is
 already only passed once to the view server, in order to be indexed by all
 the views in that design document.

 I agree there is lots that can and should be done to speed up views that do
 indeed care about most of the docs - such views spend less time relatively
 in the json encode step and more time in the interpreter.  As an experiment,
 I ported one of our views that does look at most of the docs from
 javascript to erlangview, and the performance increase was far more modest
 (20% maybe).  I suspect the javascript interpreter is faster than erlang, so
 I suspect that there will be a level of view complexity where using
 javascript *increases* view performance over erlang, even when factoring in
 the json processing...

 Cheers,

 Mark