[Patch] striped queries/faceted search

2009-02-10 Thread Frederik Fix

Hi all,

for the project I'm working on right now, I needed the ability to run  
a reduce over non-continuous ranges along the index. In order to  
achieve this I have implemented striped queries, where you can specify  
multiple startkey/endkey ranges in a single request. As a nice side  
effect this allows faceted search (for discrete keys). Heres an example:


Say I have the following map function:

function() {
   emit([doc.rooms, doc.price], doc);
}

where doc.rooms and doc.price are both integers. Now lets say i want  
to find every document with a number of rooms between 2 and 4 and a  
price between 100 and 1000. I can then do the following query:


db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2,  
1000]}, {startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100],  
endkey: [4, 1000]}]});


If the view included a reduce function that would work too.


As you can probably see this patch introduced a change to the JS API  
(but not the HTTP). The keys parameter is now a hash which can either  
take a keys param or a stripes. The keys param works as before. The  
stripes param takes an array of hashes each having a startkey, endkey  
key.



The state of the patch is still somewhat raw, with no error checking  
on the stripes part of the API. Furthermore it might be useful to  
extend the limit, skip and descending options to the stripes.


The patch is against the current trunk version (rev 742925) and all  
tests pass.


I'd appreciate some feedback on the implementation and maybe some info  
on how to proceed in integrating this into CouchDB.


Frederik














[jira] Updated: (COUCHDB-244) Striped queries

2009-02-10 Thread Frederik Fix (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Fix updated COUCHDB-244:
-

Attachment: striped_queries.diff

 Striped queries
 ---

 Key: COUCHDB-244
 URL: https://issues.apache.org/jira/browse/COUCHDB-244
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core
Reporter: Frederik Fix
 Attachments: striped_queries.diff


 I have implemented striped queries, where you can specify multiple 
 startkey/endkey ranges in a single request. As a nice side effect this allows 
 faceted search (for discrete keys). Heres an example:
 Say I have the following map function:
 function() {
   emit([doc.rooms, doc.price], doc);
 }
 where doc.rooms and doc.price are both integers. Now lets say i want to find 
 every document with a number of rooms between 2 and 4 and a price between 100 
 and 1000. I can then do the following query:
 db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2, 1000]}, 
 {startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100], endkey: [4, 
 1000]}]});
 If the view included a reduce function that would work too.
 As you can probably see this patch introduced a change to the JS API (but not 
 the HTTP). The keys parameter is now a hash which can either take a keys 
 param or a stripes. The keys param works as before. The stripes param takes 
 an array of hashes each having a startkey, endkey key.
 The state of the patch is still somewhat raw, with no error checking on the 
 stripes part of the API. Furthermore it might be useful to extend the limit, 
 skip and descending options to the stripes.
 The patch is against the current trunk version (rev 742925) and all tests 
 pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [Patch] striped queries/faceted search

2009-02-10 Thread Frederik Fix

Just noticed that the attachment got stripped.

I've added an issue to JIRA here:

https://issues.apache.org/jira/browse/COUCHDB-244


Frederik


On 10 Feb 2009, at 13:00, Jan Lehnardt wrote:


Hi Frederik,

On 10 Feb 2009, at 11:54, Frederik Fix wrote:

The patch is against the current trunk version (rev 742925) and all  
tests pass.


what patch? :)

Feel free to open a JIRA ticket and attach the patch there if this  
mailing list

doesn't let you post attachments.

https://issues.apache.org/jira/browse/COUCHDB

Cheers
Jan
--





[jira] Created: (COUCHDB-244) Striped queries

2009-02-10 Thread Frederik Fix (JIRA)
Striped queries
---

 Key: COUCHDB-244
 URL: https://issues.apache.org/jira/browse/COUCHDB-244
 Project: CouchDB
  Issue Type: New Feature
  Components: Database Core
Reporter: Frederik Fix
 Attachments: striped_queries.diff

I have implemented striped queries, where you can specify multiple 
startkey/endkey ranges in a single request. As a nice side effect this allows 
faceted search (for discrete keys). Heres an example:

Say I have the following map function:

function() {
  emit([doc.rooms, doc.price], doc);
}

where doc.rooms and doc.price are both integers. Now lets say i want to find 
every document with a number of rooms between 2 and 4 and a price between 100 
and 1000. I can then do the following query:

db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2, 1000]}, 
{startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100], endkey: [4, 
1000]}]});

If the view included a reduce function that would work too.


As you can probably see this patch introduced a change to the JS API (but not 
the HTTP). The keys parameter is now a hash which can either take a keys param 
or a stripes. The keys param works as before. The stripes param takes an array 
of hashes each having a startkey, endkey key.


The state of the patch is still somewhat raw, with no error checking on the 
stripes part of the API. Furthermore it might be useful to extend the limit, 
skip and descending options to the stripes.

The patch is against the current trunk version (rev 742925) and all tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Stats Patch API Discussion

2009-02-10 Thread Damien Katz


On Feb 10, 2009, at 10:19 AM, Jan Lehnardt wrote:


Hi,

Alex and I are working on our stats package patch and the last
bigger issue is the API. It is just exposing a bunch of values by
keys, but as usual, the devil is in the details.

Let me explain.

There are two types of counters. Hit Counters, that record
things like the number of requests. They increase monotonically
each time a request hits CouchDB. This is useful for counting
stuff. Cool.

Then there are Absolute Value Counter (for the lack of a better
term) that collects absolute values like the number of milliseconds
a request took to complete. To create a meaningful metric out
of this type of counter, we need to create averages. There's little
value in recording each individual request (it could still do that
in the access logs) for monitoring reports. So we keep some
aggregate values (min, max, mean, stddev, count (count being
the number of times this counter was called)).

Complexity++

Say you have a CouchDB running for a month. You change some
things in your app or in CouchDB and you'd like to know how this
affected your response time. To effectively see anything you'd have
to restart CouchDB (and lose all stats) or wait a month. If you'd
want to see problems coming up in your monitoring, you need finer
grained time ranges to look at this.

To make this a little more useful Alex and I introduced time ranges.
These are an additional set of aggregates that get reset every 1, 5
and 15 minutes. This should be familiar to you from server load.
You can get the aggregate values for four time ranges:

- Between now and the beginning of time (when CouchDB is
 started.
- Between now and 60 seconds ago.
- Between now and 300 seconds ago
- Between now and 900 seconds ago

These ranges are hardcoded now, but they can be made configurable
at a later time.

The API would look like this:

GET /_stats/couchdb/request_time

{
couchdb: {
  request_time: {
description: Aggregated request time spent in CouchDB since  
the beginning of time,

min:20,
max:20,
mean:20,
stddev:20,
count:7,
range:0 // 0 means since day zero.
  }
}
}

To get the aggregates stats for the last minute:

GET /_stats/couchdb/request_time?range=1

{
couchdb: {
  request_time: {
description: Aggregated request time spent in CouchDB since 1  
minute ago,

min:20,
max:20,
mean:20,
stddev:20,
count:7,
range:1 // minute
  }
}
}

Or more generic:

GET /_stats/couchdb/request_time?range=$range

{
couchdb: {
  request_time: {
description: Aggregated request time spent in CouchDB since  
$range minute ago,

min:20,
max:20,
mean:20,
stddev:20,
count:7,
range:$range // minute
  }
}
}

This seems reasonable. the actual naming of range and other
keys can be changed as well as the description text.


Complexity--

Remember Hit Counters? Yes, strictly speaking, CouchDB shouldn't
want to collect any averages there since our monitoring solution
would take care of that. But then, there are the 4 time-range counters
available and we could just as well populate them as well. Let's
say every second:

GET /_stats/httpd/requests[?$resolution=[1,5,15]]

{
couchdb: {
  request_time: {
description: Number of requests per second seconds in the  
last $reolution minutes,

min:20,
max:20,
mean:20,
stddev:20,
count:7,
range:$range // minute
  }
}
}

count would be the raw counter for the stats and the rest meaningful
aggregates.

per second is an arbitrary choice again and can be made  
configurable,
if needed. To know at what frequency stats are collected, there's a  
new

member in the list of aggregates:

{
couchdb: {
  request_time: {
description: Number of requests per $frequency seconds in the  
last $reolution minutes,

min:20,
max:20,
mean:20,
stddev:20,
count:7,
range:$range, // minute
frequency: 1 // second
  }
}
}

Alex I tried to find a couple of different approaches to get here.  
Different
URLs for the different types of counters and aggregates, adding  
members
in different places, with and without description and a whole lot  
more,

but we sure haven't seen all permutations.

This solution offers a unified URL format and a human readable as
well as a computer parseable way to determine what kind of counter
you're dealing with.

To just get all stats you can do a

GET /_stats/

and get a huge JSON object back that includes all of the above for all
resolutions that are currently collected.

Is there anything that does not make sense or is too complicated?

The goal was to create a simple, minimal API for a minimal set
of useful statistics and Alex and I hope to have found this by
now. But if you can see how this could be further simplified,
let us know :)

Alex and I also open for completely different approaches to get
the data out of CouchDB.

We're looking for a few things in this thread:

- A sanity check to know we're not completely off.
- A summary (for) 

Re: Stats Patch API Discussion

2009-02-10 Thread Jan Lehnardt


On 10 Feb 2009, at 16:47, Zachary Zolton wrote:


Jan,

So you're saying I could run some test, and then hit:

GET /_stats/couchdb/request_time?range=$SOME_MINUTES

And then, make some changes, and run the same test:

GET /_stats/couchdb/request_time?range=$SOME_MINUTES

To detect the modulo in performance caused by my changes?!?



Exactly, but where $SOME_MINUTES is hardcoded to 1, 5, or 15
to start with. If you have any reporting and graphing tool connected,
you'd have pretty pictures, too :)



That's
about the level performance tuning I'm comfortable with doing in
PostgreSQL, but it's all over HTTP instead. Nice!


Thanks,
Jan
--






Cheers,

Zach


On Tue, Feb 10, 2009 at 9:19 AM, Jan Lehnardt j...@apache.org wrote:

Hi,

Alex and I are working on our stats package patch and the last
bigger issue is the API. It is just exposing a bunch of values by
keys, but as usual, the devil is in the details.

Let me explain.

There are two types of counters. Hit Counters, that record
things like the number of requests. They increase monotonically
each time a request hits CouchDB. This is useful for counting
stuff. Cool.

Then there are Absolute Value Counter (for the lack of a better
term) that collects absolute values like the number of milliseconds
a request took to complete. To create a meaningful metric out
of this type of counter, we need to create averages. There's little
value in recording each individual request (it could still do that
in the access logs) for monitoring reports. So we keep some
aggregate values (min, max, mean, stddev, count (count being
the number of times this counter was called)).

Complexity++

Say you have a CouchDB running for a month. You change some
things in your app or in CouchDB and you'd like to know how this
affected your response time. To effectively see anything you'd have
to restart CouchDB (and lose all stats) or wait a month. If you'd
want to see problems coming up in your monitoring, you need finer
grained time ranges to look at this.

To make this a little more useful Alex and I introduced time ranges.
These are an additional set of aggregates that get reset every 1, 5
and 15 minutes. This should be familiar to you from server load.
You can get the aggregate values for four time ranges:

- Between now and the beginning of time (when CouchDB is
started.
- Between now and 60 seconds ago.
- Between now and 300 seconds ago
- Between now and 900 seconds ago

These ranges are hardcoded now, but they can be made configurable
at a later time.

The API would look like this:

GET /_stats/couchdb/request_time

{
couchdb: {
 request_time: {
   description: Aggregated request time spent in CouchDB since the
beginning of time,
   min:20,
   max:20,
   mean:20,
   stddev:20,
   count:7,
   range:0 // 0 means since day zero.
 }
}
}

To get the aggregates stats for the last minute:

GET /_stats/couchdb/request_time?range=1

{
couchdb: {
 request_time: {
   description: Aggregated request time spent in CouchDB since 1  
minute

ago,
   min:20,
   max:20,
   mean:20,
   stddev:20,
   count:7,
   range:1 // minute
 }
}
}

Or more generic:

GET /_stats/couchdb/request_time?range=$range

{
couchdb: {
 request_time: {
   description: Aggregated request time spent in CouchDB since  
$range

minute ago,
   min:20,
   max:20,
   mean:20,
   stddev:20,
   count:7,
   range:$range // minute
 }
}
}

This seems reasonable. the actual naming of range and other
keys can be changed as well as the description text.


Complexity--

Remember Hit Counters? Yes, strictly speaking, CouchDB shouldn't
want to collect any averages there since our monitoring solution
would take care of that. But then, there are the 4 time-range  
counters

available and we could just as well populate them as well. Let's
say every second:

GET /_stats/httpd/requests[?$resolution=[1,5,15]]

{
couchdb: {
 request_time: {
   description: Number of requests per second seconds in the last
$reolution minutes,
   min:20,
   max:20,
   mean:20,
   stddev:20,
   count:7,
   range:$range // minute
 }
}
}

count would be the raw counter for the stats and the rest  
meaningful

aggregates.

per second is an arbitrary choice again and can be made  
configurable,
if needed. To know at what frequency stats are collected, there's a  
new

member in the list of aggregates:

{
couchdb: {
 request_time: {
   description: Number of requests per $frequency seconds in the  
last

$reolution minutes,
   min:20,
   max:20,
   mean:20,
   stddev:20,
   count:7,
   range:$range, // minute
   frequency: 1 // second
 }
}
}

Alex I tried to find a couple of different approaches to get here.  
Different
URLs for the different types of counters and aggregates, adding  
members
in different places, with and without description and a whole lot  
more,

but we sure haven't seen all permutations.

This solution offers a unified URL format and a human readable as
well as a computer parseable way to determine what kind of counter
you're dealing with.

To just 

Eunit

2009-02-10 Thread Jan Lehnardt

Hi,

the previously mentioned stats patch introduces EUnit*-style unit
tests for Erlang code. I believe this is useful for the rest of CouchDB
as well. There is a simple test runner in test/ that includes a few
tests for the couch_config* modules but that was never meant to
be a permanent solution.

* http://svn.process-one.net/contribs/trunk/eunit/doc/overview-summary.html


EUnit is the de-facto unit testing tool for Erlang applications and
it is even included in the latest distributions of Erlang/OTP. It's
far from perfect, but I think CouchDB would benefit from adapting
it, to gain and encourage writing standardized test cases for
CouchDB modules.

The one caveat with EUnit is that it is released under the LGPL.
I am not a lawyer but the consensus on The Net is that writing
test-cases against the EUnit API and conditionally including
eunit.hrl to include the API does not mean that the test code itself
must be released under the terms of the LGPL. If anyone is
familiar with this, can you comment on whether this is correct?

Technically, the EUnit tests for each module can be in the same
file as the module's functions or in a separate directory. I'd opt
for separating tests into their on directory, but I don't feel strongly
about it.

The tests should not interfere with production code. For this reason,
there is a compile-time switch to enable tests. We can wire this up
to `make test` so that EUnit-enabled CouchDB modules are built
locally and then tested. For `make install`, the compile-time switch
to enable tests would be off.

Some modules have a few tests inline, once we would have the
EUnit infrastructure in place, it'd be a good idea to modify the  
existing

test to fit in.

Permitted that there are no licensing issues, is there anything that
would speak against adding EUnit tests to CouchDB?

--

For now, the EUnit patch is entangled in the statistics patch, but
we could separate that into its own. Would that be something that
the community is interested in? Also, somebody please clear the
legal issue :)


Cheers
Jan
--



Re: Roadmap discussion

2009-02-10 Thread Paul Davis
On Tue, Feb 10, 2009 at 11:46 AM, Kerr Rainey kerr.rai...@gmail.com wrote:
 Is there still interest in stabilising  a native erlang interface?


 --
 Kerr


Definitely. I was contemplating this a bit the other day. I wonder if
it wouldn't be beneficial to create a couch_api.erl and just define an
erlang api that maps to what other client libraries look like. Then if
someone wants to peek into the internals they're free and we can
maintain that we only support compatibility on that one file.

Any way, just an idle thought.

HTH,
Paul Davis


Re: Eunit

2009-02-10 Thread Gianugo Rabellino
On Tue, Feb 10, 2009 at 5:52 PM, Jan Lehnardt j...@apache.org wrote:
 The one caveat with EUnit is that it is released under the LGPL.
 I am not a lawyer but the consensus on The Net is that writing
 test-cases against the EUnit API and conditionally including
 eunit.hrl to include the API does not mean that the test code itself
 must be released under the terms of the LGPL. If anyone is
 familiar with this, can you comment on whether this is correct?

Best for you would be to send a question to legal-disc...@a.o. With my
conservative hat on, I'm a bit concerned about LGPL virality in
namespaced languages, and most definitely concerned with distribution
of EUnit itself (I reckon this is not necessary as EUnit is part of
OTP now?).

-- 
Gianugo Rabellino
Sourcesense, making sense of Open Source: http://www.sourcesense.com
(blogging at http://www.rabellino.it/blog/)


Re: Roadmap discussion

2009-02-10 Thread Kevin Jackson
* Full Text Search interface
 - We've had basically working patches for this floating around for a while.
 - It seems simple enough, we just need someone who comfortable in
Java to step up to the plate and write a Lucene adapter. (Thanks!)

I'm more than happy to look at this when I get time, I've been
wondering where to start hacking on couch and we use solr at work
(currently), so I would be able to justify some work time on it too

Kev


Re: Roadmap discussion

2009-02-10 Thread Robert Newson
I've made some progress on this, fwiw;

http://github.com/rnewson/couchdb-lucene

B.

On Tue, Feb 10, 2009 at 12:27 PM, Kevin Jackson foamd...@gmail.com wrote:
 * Full Text Search interface
  - We've had basically working patches for this floating around for a while.
  - It seems simple enough, we just need someone who comfortable in
 Java to step up to the plate and write a Lucene adapter. (Thanks!)

 I'm more than happy to look at this when I get time, I've been
 wondering where to start hacking on couch and we use solr at work
 (currently), so I would be able to justify some work time on it too

 Kev



Re: 0.9.0 Delay or Release?

2009-02-10 Thread Zachary Zolton
@Kerr

that 0.9 does not imply next release is 1.0.

Yeah, I was originally confused by that too!


But, then I re-read this:

http://en.wikipedia.org/wiki/Software_versioning#Software_versioning_schemes


And, now I'm cool as a cucumber, WRT having 0.10 or even 0.1000...!
LOL, it helps when I RTFM, I guess.

:^P


Cheers,

Zach


Re: Roadmap discussion

2009-02-10 Thread Chris Anderson
On Tue, Feb 10, 2009 at 8:53 AM, Paul Davis paul.joseph.da...@gmail.com wrote:
 On Tue, Feb 10, 2009 at 11:46 AM, Kerr Rainey kerr.rai...@gmail.com wrote:
 Is there still interest in stabilising  a native erlang interface?


 --
 Kerr


 Definitely. I was contemplating this a bit the other day. I wonder if
 it wouldn't be beneficial to create a couch_api.erl and just define an
 erlang api that maps to what other client libraries look like. Then if
 someone wants to peek into the internals they're free and we can
 maintain that we only support compatibility on that one file.


I've been interfacing with the raw Erlang API for a commercial
project. It works like a charm, the only trouble being that it isn't
documented, and that it could change out from under me with no
warning. (Although the second caveat isn't as bad as it sounds,
because it probably won't change much.)

From my experience, I'm having a hard time seeing how any additional
code could help make the Erlang API official. The project I'm
working on has a very specific data model (no updates, lots of
parallel attachment writing, using the HTTP API for everything but the
critical path...) and using the Erlang API has allowed me to cut out a
lot of code paths (eg rev checking etc). Doing this wouldn't be safe
for a general purpose API, but when you are interfacing in Erlang,
you're not using a general purpose API anyway.

I'm happy to have an Erlang API, but maybe it should wait til sometime
after 0.9. I think the best way to ensure that it's maintained as
stable would be to have an Erlang integration suite, which could
double as documentation. It certainly wouldn't hurt to have more
Erlang tests, so maybe we can file this feature under testing for now,
and hope we get an Erlang test suite created by interested parties.
Once we have the test suite we'll know what the Erlang API is.

-- 
Chris Anderson
http://jchris.mfdz.com


Re: Eunit

2009-02-10 Thread Michael McDaniel
On Tue, Feb 10, 2009 at 06:19:01PM +0100, Jan Lehnardt wrote:

 On 10 Feb 2009, at 18:11, Gianugo Rabellino wrote:

 On Tue, Feb 10, 2009 at 5:52 PM, Jan Lehnardt j...@apache.org wrote:
 The one caveat with EUnit is that it is released under the LGPL.
 I am not a lawyer but the consensus on The Net is that writing
 test-cases against the EUnit API and conditionally including
 eunit.hrl to include the API does not mean that the test code itself
 must be released under the terms of the LGPL. If anyone is
 familiar with this, can you comment on whether this is correct?

 Best for you would be to send a question to legal-disc...@a.o. With my
 conservative hat on, I'm a bit concerned about LGPL virality in
 namespaced languages, and most definitely concerned with distribution
 of EUnit itself (I reckon this is not necessary as EUnit is part of
 OTP now?).

 Thanks, I'll check with legal-discuss@ when this list agrees on adding
 EUnit support. Bundling EUnit is not necessary as of the latest OTP
 release and for earlier releases you need to install it manually or you
 can't run `make test` which is not too much of a problem, I'd say.
 Thanks
 Jan
^

 I am in agreement with adding EUnit support to CouchDB.

 I find EUnit useful when writing or changing code.

 Also, I support a separate test directory from the src

 ../doc
 ../ebin
 ../include
 ../priv
 ../src
 ../test

 (though I know CouchDB does not presently follow that OTP 
  recommended directory structure)


 I use EUnit in a simple manner based on a techniques derived
 from two erlang-questions threads,

 http://www.nabble.com/I-Hate-Unit-Testing...-td21697138.html

 
http://www.nabble.com/Lightweight-test-driven-development-and-unit-testing-td21704767.html

 
 e.g.

 %  the module to test

 -module(my_mod).
 -export( [public_funs ...]  ).
 -include(../test/my_mod_test.erl).  % yes, .erl
  
  blah blah blah all the module funs both exported and private
.
% end of my_mod.erl


 and ...

 %  the testing module

 -module(my_mod_test.erl).
 -ifdef(test).
 -include_lib(eunit/include/eunit.hrl).

 first_test()  - ?assert( true = first_fun(arg) ).
 second_test() - ?assert( {error, badarg} = second_fun(badarg) ).

   blah blah blah remaining tests

 -else.
 -endif.

% end of my_mod_test.erl



 When compiling, simply do not -define( test ) and the my_mod_test
 module is not even included.  Note that this technique allows
 testing of private funs in my_mod without having to export the
 private funs.

 I also use Emakefile such as ...
 %%
 %% to make from command line do following
 %%erl -make
 %% do not run tests from command line do following
 %%erl -pa ../ebin -eval eunit:test(my_mod, [verbose]), init:stop().
 %
 {'*', [
{outdir, ../ebin}
   ,{i, ../include}
   ,{i, ../test}
   ,debug_info
   ,strict_record_tests
   ,netload
   , {d,debug} %% uncomment for debug
   , {d,test}  %% uncomment for dev/test, do touch ../test/*
  ] 
 }.



~Michael

---
Portland, Oregon, USA
http://autosys.us


Re: Eunit

2009-02-10 Thread Jan Lehnardt

Hi Micheal,

On 10 Feb 2009, at 19:29, Michael McDaniel wrote:


Thanks, I'll check with legal-discuss@ when this list agrees on  
adding

EUnit support. Bundling EUnit is not necessary as of the latest OTP
release and for earlier releases you need to install it manually or  
you

can't run `make test` which is not too much of a problem, I'd say.
Thanks
Jan

^

I am in agreement with adding EUnit support to CouchDB.

I find EUnit useful when writing or changing code.

Also, I support a separate test directory from the src

../doc
../ebin
../include
../priv
../src
../test
(though I know CouchDB does not presently follow that OTP
 recommended directory structure)



I don't want to complicate the proposed patches. I'm not for
or against this, but we should think about that separately.
Maybe open a JIRA ticket so we don't forget about this that
includes a short description of the benefits?



[...]

When compiling, simply do not -define( test ) and the my_mod_test
module is not even included.  Note that this technique allows
testing of private funs in my_mod without having to export the
private funs.

I also use Emakefile such as ...


Since we're using good old make, this would look a little different
in practice, but would do the same thing :)

Thanks for chiming and the tips.

Cheers
Jan
--



Re: Roadmap discussion

2009-02-10 Thread Kerr Rainey
2009/2/10 Michael McDaniel couc...@autosys.us:

  ... also, an Erlang API that skips the

   JSON -convert- native Erlang terms

  translation overhead.  Being as term translation is not necessary
  when talking 'directly' with the CDB engine
  (e.g. couch_query_servers:map_docs/2 could skip the JSON - term()
  translation if the view engine reads/writes native Erlang terms)

Interesting.  I'd certainly consider this another level further than
what I was thinking of, or indeed would be thinking of using.  There
is probably a few levels where couch functionality could be exposed
natively.

I wonder how much doing this kind of bypassing for a native erlang
view engine would complicate the code?  Or would it give another clean
layer?

--
Kerr


View Intersections

2009-02-10 Thread Paul Davis
I've been contemplating implementing a new feature that I've been
wanting for awhile. There's been  some talk of implementing view
intersections for a bit now so I figured I'd try and give a summary of
what the feature would entail in terms of functionality and then the
necessary bits required for an implementation.

So the original idea for view intersections was exactly what the name
entails: Show me the intersection between two views for a given set of
view query parameters. After thinking about different methods of
implementation I think we can extend this to be more powerful and
generally applicable.

Major Hurdle 1


The first necessary bit of ground work would be to implement an
optional value index on views. The more I thought about intersecting
views the more I realized it was starting to look pointless. Ignoring
something along the lines of group_level=N in that we can join on
array prefixes, all views being joined would require exactly the same
key. Which begs the question, why not just create 1 view that emits
the output of the two you want intersected.

I couldn't get past this for a long time until I heard Chris Anderson
pondering adding a btree index to the values in a view. The obvious
downfalls of the extra space and computation usage are there, but
making it optional should solve any qualms in that respect.

Given an index on a value we're now able to chain together arbitrary
views using either the key or value as well as limit the intersection
by any combination of key and value.

As a side benefit, we would also get the select views by value
restriction as well. I'm thinking it'd be as transparent as adding a
[start|end]value and [start|end]value_docid set of URL parameters. I
haven't followed this train of thought too far into the code yet, but
something approximating that should be fairly doable. A thought occurs
that limiting view results by both key and value could be interesting
in terms of implementation. Not sure if I'd force it through the
intersection API or not.

Caveats that come to mind are that this would break binary
compatibility for all generated views. It wouldn't require a
dump/reload, but it might come as a surprise to people upgrading that
all their views are regenerating.

Major Hurdle 2


Implementing the view intersection API. First off, it probably needs a
new name. Once we have intersections working, unions, subtractions,
and the NxM one who's name escapes me (cross product floats up but
sounds not right) should be trivially implementable.

The underlying implementation for this is basically a large merge sort
running over the view btree's. If you read about the merge step in
map/reduce/merge that's basically what I've got in my head.

The biggest issue that I've found in getting this implemented
(excluding a value index) is that I'd need to write a new btree
traversal method that used iterators instead of a fold mechanism. This
shouldn't be overly difficult to implement.

Beyond that then it's basically up to the HTTP interface in parameter
parsing and error checking. For passing parameters I'm thinking along
the line of a JSON body posted (Premptive: any RESTafarians should
reference the long discussion on multi-get before writing about how
this isn't RESTful).

Also, not sure if it's obvious but I'd plan on allowing arbitrarily
nested conditions, ie, intersection(union(a, b), c) type of
operations. There's a subtle detail in the sort order and thus
corresponding btree traversal that might come into play there. I can
punt and make the entire request use one sort order, as in the
previous example can't specify different sort directions for the two
nested operations because you'd get a (presumably) zero overlap in the
end. I'm pretty sure if we force all btrees to be traversed in the
same direction for each request we don't lose any functionality
though.

Comments
=

That's the general outline I've got in my head right now. I'm pretty
sure I can see 95% of the implementation, but it's possible I'm
missing a finer detail somewhere. If you've got questions or comments
let's hear them. If there's no general objection then I can probably
get to starting an implementation at the end of this week.

Thanks,
Paul Davis


Re: Stats Patch API Discussion

2009-02-10 Thread Noah Slater
CouchDB is designed so that it can crash and restart. What possibilities are
there for having statistics persisted between runs, for this reason?

-- 
Noah Slater, http://tumbolia.org/nslater


Re: Stats Patch API Discussion

2009-02-10 Thread Paul Davis
On Tue, Feb 10, 2009 at 8:11 PM, Noah Slater nsla...@apache.org wrote:
 CouchDB is designed so that it can crash and restart. What possibilities are
 there for having statistics persisted between runs, for this reason?


I'd argue that we should let the stats collection packages deal with
persisting anything that needs it. I tend to agree with Jan's earlier
comments that this should be about generating data and we leave the
pretty graphs to dedicated software.

 --
 Noah Slater, http://tumbolia.org/nslater


HTH,
Paul Davis


Re: View Intersections

2009-02-10 Thread Jeff Hinrichs - DMT
Just a few  comment to get things started.

On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com wrote:
 I've been contemplating implementing a new feature that I've been
 wanting for awhile. There's been  some talk of implementing view
 intersections for a bit now so I figured I'd try and give a summary of
 what the feature would entail in terms of functionality and then the
 necessary bits required for an implementation.

 So the original idea for view intersections was exactly what the name
 entails: Show me the intersection between two views for a given set of
 view query parameters. After thinking about different methods of
 implementation I think we can extend this to be more powerful and
 generally applicable.

 Major Hurdle 1
 

 The first necessary bit of ground work would be to implement an
 optional value index on views. The more I thought about intersecting
 views the more I realized it was starting to look pointless. Ignoring
 something along the lines of group_level=N in that we can join on
 array prefixes, all views being joined would require exactly the same
 key. Which begs the question, why not just create 1 view that emits
 the output of the two you want intersected.

 I would argue that returning a simple list of docids that meet the
requirement should suffice -- in fact, the views a and b need not be
homogenous so returning anything beyond docids could end up being a
bigger problem than the intersection itself.

For instance, say we want the intersection of the documents who have
both blue and fuzzy tags so we use
a = /_view/tags/byval?key=blue
b = /_view/tags/byval?key=fuzzy

intersection(a,b)

Now we want to limit that to things named Harold.

c=/_view/name/first?key=Harold

intersection(intersection(a,b),c)

Which gives us a list of docid's that contain Blue, Fuzzy things named Harold.

However, the values returned by view a and view b are the same,
however the values returned by view c might be completely different.
So returning a view with varying values might not be very helpful
(This is where I am not seeing why more than returning a list of
docid's would be appropriate. Of course I am most likely missing the
point.)  Only returning intersections of similar views would not be as
interesting a returning intersections of dissimilar views.



 I couldn't get past this for a long time until I heard Chris Anderson
 pondering adding a btree index to the values in a view. The obvious
 downfalls of the extra space and computation usage are there, but
 making it optional should solve any qualms in that respect.

 Given an index on a value we're now able to chain together arbitrary
 views using either the key or value as well as limit the intersection
 by any combination of key and value.

 As a side benefit, we would also get the select views by value
 restriction as well. I'm thinking it'd be as transparent as adding a
 [start|end]value and [start|end]value_docid set of URL parameters. I
 haven't followed this train of thought too far into the code yet, but
 something approximating that should be fairly doable. A thought occurs
 that limiting view results by both key and value could be interesting
 in terms of implementation. Not sure if I'd force it through the
 intersection API or not.

 Caveats that come to mind are that this would break binary
 compatibility for all generated views. It wouldn't require a
 dump/reload, but it might come as a surprise to people upgrading that
 all their views are regenerating.

 Major Hurdle 2
 

 Implementing the view intersection API. First off, it probably needs a
 new name. Once we have intersections working, unions, subtractions,
 and the NxM one who's name escapes me (cross product floats up but
 sounds not right) should be trivially implementable.

 The underlying implementation for this is basically a large merge sort
 running over the view btree's. If you read about the merge step in
 map/reduce/merge that's basically what I've got in my head.

 The biggest issue that I've found in getting this implemented
 (excluding a value index) is that I'd need to write a new btree
 traversal method that used iterators instead of a fold mechanism. This
 shouldn't be overly difficult to implement.

 Beyond that then it's basically up to the HTTP interface in parameter
 parsing and error checking. For passing parameters I'm thinking along
 the line of a JSON body posted (Premptive: any RESTafarians should
 reference the long discussion on multi-get before writing about how
 this isn't RESTful).

 posting json documents seems to be required and beyond argument given
the technical size limits of a GET request


 Also, not sure if it's obvious but I'd plan on allowing arbitrarily
 nested conditions, ie, intersection(union(a, b), c) type of
 operations. There's a subtle detail in the sort order and thus
 corresponding btree traversal that might come into play there. I can
 punt and make the entire request use one 

Re: View Intersections

2009-02-10 Thread Paul Davis
On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DMT
dunde...@gmail.com wrote:
 Just a few  comment to get things started.

 On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com 
 wrote:
 I've been contemplating implementing a new feature that I've been
 wanting for awhile. There's been  some talk of implementing view
 intersections for a bit now so I figured I'd try and give a summary of
 what the feature would entail in terms of functionality and then the
 necessary bits required for an implementation.

 So the original idea for view intersections was exactly what the name
 entails: Show me the intersection between two views for a given set of
 view query parameters. After thinking about different methods of
 implementation I think we can extend this to be more powerful and
 generally applicable.

 Major Hurdle 1
 

 The first necessary bit of ground work would be to implement an
 optional value index on views. The more I thought about intersecting
 views the more I realized it was starting to look pointless. Ignoring
 something along the lines of group_level=N in that we can join on
 array prefixes, all views being joined would require exactly the same
 key. Which begs the question, why not just create 1 view that emits
 the output of the two you want intersected.

  I would argue that returning a simple list of docids that meet the
 requirement should suffice -- in fact, the views a and b need not be
 homogenous so returning anything beyond docids could end up being a
 bigger problem than the intersection itself.

 For instance, say we want the intersection of the documents who have
 both blue and fuzzy tags so we use
 a = /_view/tags/byval?key=blue
 b = /_view/tags/byval?key=fuzzy

 intersection(a,b)

 Now we want to limit that to things named Harold.

 c=/_view/name/first?key=Harold

 intersection(intersection(a,b),c)

 Which gives us a list of docid's that contain Blue, Fuzzy things named Harold.

 However, the values returned by view a and view b are the same,
 however the values returned by view c might be completely different.
 So returning a view with varying values might not be very helpful
 (This is where I am not seeing why more than returning a list of
 docid's would be appropriate. Of course I am most likely missing the
 point.)  Only returning intersections of similar views would not be as
 interesting a returning intersections of dissimilar views.



 I couldn't get past this for a long time until I heard Chris Anderson
 pondering adding a btree index to the values in a view. The obvious
 downfalls of the extra space and computation usage are there, but
 making it optional should solve any qualms in that respect.

 Given an index on a value we're now able to chain together arbitrary
 views using either the key or value as well as limit the intersection
 by any combination of key and value.

 As a side benefit, we would also get the select views by value
 restriction as well. I'm thinking it'd be as transparent as adding a
 [start|end]value and [start|end]value_docid set of URL parameters. I
 haven't followed this train of thought too far into the code yet, but
 something approximating that should be fairly doable. A thought occurs
 that limiting view results by both key and value could be interesting
 in terms of implementation. Not sure if I'd force it through the
 intersection API or not.

 Caveats that come to mind are that this would break binary
 compatibility for all generated views. It wouldn't require a
 dump/reload, but it might come as a surprise to people upgrading that
 all their views are regenerating.

 Major Hurdle 2
 

 Implementing the view intersection API. First off, it probably needs a
 new name. Once we have intersections working, unions, subtractions,
 and the NxM one who's name escapes me (cross product floats up but
 sounds not right) should be trivially implementable.

 The underlying implementation for this is basically a large merge sort
 running over the view btree's. If you read about the merge step in
 map/reduce/merge that's basically what I've got in my head.

 The biggest issue that I've found in getting this implemented
 (excluding a value index) is that I'd need to write a new btree
 traversal method that used iterators instead of a fold mechanism. This
 shouldn't be overly difficult to implement.

 Beyond that then it's basically up to the HTTP interface in parameter
 parsing and error checking. For passing parameters I'm thinking along
 the line of a JSON body posted (Premptive: any RESTafarians should
 reference the long discussion on multi-get before writing about how
 this isn't RESTful).

  posting json documents seems to be required and beyond argument given
 the technical size limits of a GET request


 Also, not sure if it's obvious but I'd plan on allowing arbitrarily
 nested conditions, ie, intersection(union(a, b), c) type of
 operations. There's a subtle detail in the sort order and thus
 

[jira] Created: (COUCHDB-245) Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead.

2009-02-10 Thread alisdair sullivan (JIRA)
Couch uses the erlang stdlib module regexp, which is deprecated and set to be 
removed. It should use the module re instead.
---

 Key: COUCHDB-245
 URL: https://issues.apache.org/jira/browse/COUCHDB-245
 Project: CouchDB
  Issue Type: Bug
  Components: Infrastructure
Affects Versions: 0.7.2, 0.8, 0.8.1
 Environment: regexp is set to be removed from stdlib when R15 is 
released.
Reporter: alisdair sullivan


Couch uses the erlang stdlib module regexp, which is deprecated and set to be 
removed. It should use the module re instead. re is not a drop in replacement 
for regexp, it operates on and returns binary strings instead of native 
strings. 

Affects files couch_config.erl, couch_config_writer.erl, couch_httpd.erl, 
couch_httpd_server.erl, couch_log.erl and couch_server.erl. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-245) Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead.

2009-02-10 Thread alisdair sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

alisdair sullivan updated COUCHDB-245:
--

Affects Version/s: 0.9

 Couch uses the erlang stdlib module regexp, which is deprecated and set to be 
 removed. It should use the module re instead.
 ---

 Key: COUCHDB-245
 URL: https://issues.apache.org/jira/browse/COUCHDB-245
 Project: CouchDB
  Issue Type: Bug
  Components: Infrastructure
Affects Versions: 0.7.2, 0.8, 0.8.1, 0.9
 Environment: regexp is set to be removed from stdlib when R15 is 
 released.
Reporter: alisdair sullivan
   Original Estimate: 2h
  Remaining Estimate: 2h

 Couch uses the erlang stdlib module regexp, which is deprecated and set to be 
 removed. It should use the module re instead. re is not a drop in replacement 
 for regexp, it operates on and returns binary strings instead of native 
 strings. 
 Affects files couch_config.erl, couch_config_writer.erl, couch_httpd.erl, 
 couch_httpd_server.erl, couch_log.erl and couch_server.erl. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: View Intersections

2009-02-10 Thread Jeff Hinrichs - DMT
On Tue, Feb 10, 2009 at 9:58 PM, Paul Davis paul.joseph.da...@gmail.com wrote:
 On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DMT
 dunde...@gmail.com wrote:
 Just a few  comment to get things started.

 On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com 
 wrote:
 I've been contemplating implementing a new feature that I've been
 wanting for awhile. There's been  some talk of implementing view
 intersections for a bit now so I figured I'd try and give a summary of
 what the feature would entail in terms of functionality and then the
 necessary bits required for an implementation.

 So the original idea for view intersections was exactly what the name
 entails: Show me the intersection between two views for a given set of
 view query parameters. After thinking about different methods of
 implementation I think we can extend this to be more powerful and
 generally applicable.

 Major Hurdle 1
 

 The first necessary bit of ground work would be to implement an
 optional value index on views. The more I thought about intersecting
 views the more I realized it was starting to look pointless. Ignoring
 something along the lines of group_level=N in that we can join on
 array prefixes, all views being joined would require exactly the same
 key. Which begs the question, why not just create 1 view that emits
 the output of the two you want intersected.

  I would argue that returning a simple list of docids that meet the
 requirement should suffice -- in fact, the views a and b need not be
 homogenous so returning anything beyond docids could end up being a
 bigger problem than the intersection itself.

 For instance, say we want the intersection of the documents who have
 both blue and fuzzy tags so we use
 a = /_view/tags/byval?key=blue
 b = /_view/tags/byval?key=fuzzy

 intersection(a,b)

 Now we want to limit that to things named Harold.

 c=/_view/name/first?key=Harold

 intersection(intersection(a,b),c)

 Which gives us a list of docid's that contain Blue, Fuzzy things named 
 Harold.

 However, the values returned by view a and view b are the same,
 however the values returned by view c might be completely different.
 So returning a view with varying values might not be very helpful
 (This is where I am not seeing why more than returning a list of
 docid's would be appropriate. Of course I am most likely missing the
 point.)  Only returning intersections of similar views would not be as
 interesting a returning intersections of dissimilar views.



 I couldn't get past this for a long time until I heard Chris Anderson
 pondering adding a btree index to the values in a view. The obvious
 downfalls of the extra space and computation usage are there, but
 making it optional should solve any qualms in that respect.

 Given an index on a value we're now able to chain together arbitrary
 views using either the key or value as well as limit the intersection
 by any combination of key and value.

 As a side benefit, we would also get the select views by value
 restriction as well. I'm thinking it'd be as transparent as adding a
 [start|end]value and [start|end]value_docid set of URL parameters. I
 haven't followed this train of thought too far into the code yet, but
 something approximating that should be fairly doable. A thought occurs
 that limiting view results by both key and value could be interesting
 in terms of implementation. Not sure if I'd force it through the
 intersection API or not.

 Caveats that come to mind are that this would break binary
 compatibility for all generated views. It wouldn't require a
 dump/reload, but it might come as a surprise to people upgrading that
 all their views are regenerating.

 Major Hurdle 2
 

 Implementing the view intersection API. First off, it probably needs a
 new name. Once we have intersections working, unions, subtractions,
 and the NxM one who's name escapes me (cross product floats up but
 sounds not right) should be trivially implementable.

 The underlying implementation for this is basically a large merge sort
 running over the view btree's. If you read about the merge step in
 map/reduce/merge that's basically what I've got in my head.

 The biggest issue that I've found in getting this implemented
 (excluding a value index) is that I'd need to write a new btree
 traversal method that used iterators instead of a fold mechanism. This
 shouldn't be overly difficult to implement.

 Beyond that then it's basically up to the HTTP interface in parameter
 parsing and error checking. For passing parameters I'm thinking along
 the line of a JSON body posted (Premptive: any RESTafarians should
 reference the long discussion on multi-get before writing about how
 this isn't RESTful).

  posting json documents seems to be required and beyond argument given
 the technical size limits of a GET request


 Also, not sure if it's obvious but I'd plan on allowing arbitrarily
 nested conditions, ie, intersection(union(a, b), 

Helping out

2009-02-10 Thread David Van Couvering
Hello.  I've been following CouchDB from the sidelines for a while but
haven't been able to put much time into it.

Recently, however, Sun laid me off, and I thought this would be a good
opportunity to get a little more engaged.

No better way, IMHO, than to help out with the project.  FYI, I'm already a
committer to Apache Derby, although I haven't been active there in the past
few years.

I was looking at your road map and it looked like you want to get a lot of
documentation written.  I was thinking that would be a great way for me to
start learning CouchDB.  Is there a specific document that you would like to
me to try my hand at?  Also, what are your processes, technologies and
standards around documentation?

I can also start poking around at your bug list and perhaps offer some
patches to get my feet wet.  Is there anything in particular that you would
like someone to focus on?  I don't have an Erlang background, although I'm
interested in learning.  My background is server-side Java and databases,
for the most part.

I look forward to hearing from you.  Meanwhile I'll try to get a build going
and see how that goes.

All the best,

David

-- 
David W. Van Couvering
http://davidvancouvering.blogspot.com


[jira] Created: (COUCHDB-246) allow customization of external process timeout

2009-02-10 Thread Robert Newson (JIRA)
allow customization of external process timeout
---

 Key: COUCHDB-246
 URL: https://issues.apache.org/jira/browse/COUCHDB-246
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.9
Reporter: Robert Newson
Priority: Blocker
 Fix For: 0.9



If an external process takes too long to respond, it is killed. The timeout is 
quite short (a few seconds) and is not configurable from .ini files today.

couchdb-lucene could use this ability as the first attempt to sort on a field 
in a large index is slow while it builds a cache. With the timeout, it's killed 
and the partial work is lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-246) allow customization of external process timeout

2009-02-10 Thread Antony Blakey (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672520#action_12672520
 ] 

Antony Blakey commented on COUCHDB-246:
---

I have this problem with _externals that implement lazy-view update semantics. 
The problem being that the amount of work required for the _external to catchup 
to the current update_seq is unknown. I experimented with a solution that 
allows the external to return a keep-alive message to the server, which doesn't 
return a value to the client but does stop the server killing the external.

I got distracted and didn't complete that work, but I think this is a better 
solution than a fixed timeout. The problem with a timeout is that it doesn't 
account for machine performance or load, or the possible highly variable amount 
of work that the external needs to do on a per-request basis, whereas a 
keep-alive more correctly captures what you want e.g. the external process is 
making progress. Such a keep alive could specify a timeout value, so that the 
external process could control the definition of failure according to how often 
it will send keep-alives, but that might be an unnecessary complication.

 allow customization of external process timeout
 ---

 Key: COUCHDB-246
 URL: https://issues.apache.org/jira/browse/COUCHDB-246
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.9
Reporter: Robert Newson
Priority: Blocker
 Fix For: 0.9


 If an external process takes too long to respond, it is killed. The timeout 
 is quite short (a few seconds) and is not configurable from .ini files today.
 couchdb-lucene could use this ability as the first attempt to sort on a field 
 in a large index is slow while it builds a cache. With the timeout, it's 
 killed and the partial work is lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Helping out

2009-02-10 Thread Paul Davis
On Wed, Feb 11, 2009 at 12:27 AM, David Van Couvering
da...@vancouvering.com wrote:
 Hello.  I've been following CouchDB from the sidelines for a while but
 haven't been able to put much time into it.

 Recently, however, Sun laid me off, and I thought this would be a good
 opportunity to get a little more engaged.

 No better way, IMHO, than to help out with the project.  FYI, I'm already a
 committer to Apache Derby, although I haven't been active there in the past
 few years.

 I was looking at your road map and it looked like you want to get a lot of
 documentation written.  I was thinking that would be a great way for me to
 start learning CouchDB.  Is there a specific document that you would like to
 me to try my hand at?  Also, what are your processes, technologies and
 standards around documentation?

 I can also start poking around at your bug list and perhaps offer some
 patches to get my feet wet.  Is there anything in particular that you would
 like someone to focus on?  I don't have an Erlang background, although I'm
 interested in learning.  My background is server-side Java and databases,
 for the most part.

 I look forward to hearing from you.  Meanwhile I'll try to get a build going
 and see how that goes.

 All the best,

 David

 --
 David W. Van Couvering
 http://davidvancouvering.blogspot.com


David,

It's awesome to hear your interest especially given your recent situation.

Re: Documentation

As far as I'm aware the only guidelines in terms of documentation are
to put things on the wiki. I would say that if a specific section of
CouchDB interests you, start learning the code base from that aspect
and add good wiki information on it. I know that I, for one, am not
the most vigilant in keeping things in sync.

Another aspect to documentation would also be documenting the Erlang
documentation best practices. It doesnt sound as sexy, but getting a
good set of rules for native Erlang documentation would be a Good
Thing trade;. There have been attempts at getting autogenerated docs.
Having a good distillation of rules as well as a working build
integration with the website would be an awesome advancement.

Re: Patches

The two biggest suggestions I have would be to start reading code via
the *_httpd_*.erl sources. In terms of behavior, these have the most
documentation as well as being a very logical root point to start
tracing code paths. If you have something that tickles your fancy, its
fun to follow an HTTP request all the way to disk. I took a shining to
view generation and ended up reading through the btree code. There's
lots of the seductive, No fucking way it can be this easy type of
code that makes the internals fun to read through.

My other suggestion is fairly closely related. Start walking through
the list of bugs that are blocking for 0.9 and see what you're
comfortable dealing with. I'd definitely suggest adding comments to
bugs or popping on IRC if you find something approachable. JIRA is a
PITA when it comes to assigning things, so I spend a good chunk of my
time trying to remember if someone on the ML or IRC claimed progress
or on going work.

For reference, Jan has an awesome page setup that will get you the
list of blocking issues for 0.9 at [1]. Hopefully he'll keep it
updated beyond the 0.9 release.

[1] http://jan.prima.de/fuckjira.html

HTH and welcome to the community,
Paul Davis