Re: multiview on github

Paul Davis Tue, 21 Sep 2010 08:38:11 -0700

Norman,

Sorry its taken me so long to review this code. In its current form I
would have to -1 adding the current implementation to trunk for a
couple reasons. I'm roughly +0 on the general outline of the algorithm
for future inclusion, but I'll discuss that below.

The biggest issue that jumps out is that its unbounded in its use of
memory. If I'm reading this code correctly, each view/spatial/fti
query grabs its entire list of document id's and creates a record that
stores this list. Then you create a ring of processes that then copies
these lists possibly multiple times and in the worst way as the larger
the list, the more times its copied. Then inside the ring the queries
are being re-run for each test of an id being present which is
confuses me because they could be using the list of id's that were
calculated during the calls to multiview:query_view_count/3. Granted I
could be reading this wrong, but its a bit hard to follow in places.
Also, at least for fti and views, you don't actually need to enumerate
the entire thing to get a row count as they can both report a count
efficiently. I'm not sure about spatial, but even if it can't yet, I
would imagine it could be implemented.

And now for a list of nits about mechanics:

The source code for this patch is completely unlike anything else in
CouchDB. There are lots of differences that add up to make this alone
reason to prevent it from entering into trunk:

The file headers in source files should be removed and replaced with
ASF license headers.
Source files must be less than eighty columns wide.
You've accidentally committed local_dev.ini and etc/init/couchdb.
I'd like to see more tests in the futon tests.
If this ends up in trunk, it will not be able to depend on the spatial
and fti handlers existing if they're not also in trunk. This might be
solvable with an abstraction that can be dynamically added if they're
present.
AFAICT, error reporting doesn't seem to exist, and it looks like
there's a lot of new surface area for generating errors.
The supervisor/gen_server pattern that's going on here doesn't appear
to have a reason. As in, I can't see a reason the gen_server even
needs to exist. Just make the multiview:query call from the HTTP
process.
In Erlang, the term Node generally refers to a remote VM. Using the
variable Node in your query ring code confused me greatly until I
realized it was just pid's.
You should generally avoid raw message passing in Erlang. Using a
gen_server for each of the different ring members depending on query
type would be more appropriate.
If you are using gen_servers, you should fill out more of the
callbacks to do meaningful things, ie, logging and/or dying on
unexpected messages. Silently ignoring that sort of thing can lead to
very hard to track bugs.
Module names should be prefixed with couchdb_ if they're going into
trunk as part of couchdb.
I'm not sure I like the generous use of pmap and friends. I understand
that it'd ideally reduce latency, but at the burden of reducing a
node's ability to handle concurrency. Not sure on the best solution to
this though.
In the couple places that have the big case statements for handling
each type of view query, I'd transform those into functions to make
things easier to follow.
Support for view parameters is limited to startkey and endkey. At the
very least, start_docid and end_docid should be supported. The other
various parameters affecting collation should also probably be
supported. limit and count would be nice. I'm sure there are probably
others too, but there are also ones that probably don't need to be
included.
How should reduces be handled, if at all? I don't see them being
handled now, but I can assure you that people will want some sort of
support if this goes into trunk.
Passing the view groups between processes does not seem like a good
idea. I'd have to look back at the view_group code to double check
that though.

Now I'll stop bitching and tell you that there is actually some hope
and I'm intrigued where this could go.

The current algorithm structure you have is pretty interesting. I
think with a couple improvements it would go along way.

If I were to write this, I would start by cleaning up the row counts
code to give a quicker response without iterating each query. Once you
have the row counts, for every query except the largest, iterate over
the output to generate a bloom filter of id's. contained in that view
query. Then to send data to the client you just iterate over the
largest query checking that the id is in each of the bloom filters.

For a NIF version of Bloom filters, check here:
http://github.com/basho/ebloom There's also a blog post by Jonathan
Ellis from the Cassandra group that gives some pretty good details on
Bloom filters: 
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
Also there's probably a wikipedia page.

This layout also gives the ability to perform future optimizations
that would make things even more quicker. For instance, bloom filters
could be pre-computed using a built-in reduce function. Adding a
special call for "row_count_and_filter" would then be super quick to
get the information you need before iterating just one view which
could be the smallest view at that point for extra win.

Using a bloom filter check for each query also gives the ability to
more easily do complex boolean checks among the various view queries.

As to whether this should go into trunk, I was talking with Volker the
other day I was a bit surprised when he mentioned that he wasn't sure
he wanted the spatial indexer in trunk. He was working on making the
indexer an external plugin for couchdb without requiring a complete
fork. I think if we put in the effort to make plugins like this easy
to bundle and install then there might not be much of a reason to
include more things in trunk. And if we do go with the plugin route,
I'd still want these and couchdb-lucene in a contrib section so that
people feel safer using them so they get tested more so that they can
eventually go into trunk as built-in features.

And I'm going to stop typing now.

Paul Davis

On Fri, Jul 30, 2010 at 5:39 PM, Norman Barker <[email protected]> wrote:
> Hi,
>
> a very initial version of the multiview is at
> http://github.com/normanb/couchdb-multiview for discussion.
>
> The views are intersected by using a ring of processes where each node
> in the ring represents a view as follows;
>
> % send an id from the start list to the next node in the ring, if the
> id is in adjacent node then this node sends to the next ring node ....
> % if the id gets all round the ring and back to the start node then it
> has intersected all queries and should be included. The nodes in the
> ring
> % should be sorted in size from small to large for this to be effective
> %
> % In addition send the initial id list round in parallel
>
> this is implemented in the couch_query_ring module.
>
> I have a couple of questions
>
> 1) in the module multiview, is there a quicker way to find the counts
> from startkey to endkey rather than iterating?
> 2) In the module couch_query_ring is there a quicker way to test for
> inclusion rather than iterating?
> 3) Finally, if I hit this concurrently I get an exception,
>
> [error] [<0.201.0>] Uncaught error in HTTP request: {exit,
>                                 {noproc,
>                                  {gen_server,call,
>
> (so ignore my previous email, I am able to trap the msg)
>
> I am going to look into (3) but if you have seen this before.
>
> I am developing on windows, but also test on linux I will work on
> getting a linux makefile, but the Makefile.win should be a start.
>
> Any help and comments appreciated.
>
> Norman
>

Re: multiview on github

Reply via email to