[Patch] striped queries/faceted search
Hi all, for the project I'm working on right now, I needed the ability to run a reduce over non-continuous ranges along the index. In order to achieve this I have implemented striped queries, where you can specify multiple startkey/endkey ranges in a single request. As a nice side effect this allows faceted search (for discrete keys). Heres an example: Say I have the following map function: function() { emit([doc.rooms, doc.price], doc); } where doc.rooms and doc.price are both integers. Now lets say i want to find every document with a number of rooms between 2 and 4 and a price between 100 and 1000. I can then do the following query: db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2, 1000]}, {startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100], endkey: [4, 1000]}]}); If the view included a reduce function that would work too. As you can probably see this patch introduced a change to the JS API (but not the HTTP). The keys parameter is now a hash which can either take a keys param or a stripes. The keys param works as before. The stripes param takes an array of hashes each having a startkey, endkey key. The state of the patch is still somewhat raw, with no error checking on the stripes part of the API. Furthermore it might be useful to extend the limit, skip and descending options to the stripes. The patch is against the current trunk version (rev 742925) and all tests pass. I'd appreciate some feedback on the implementation and maybe some info on how to proceed in integrating this into CouchDB. Frederik
[jira] Updated: (COUCHDB-244) Striped queries
[ https://issues.apache.org/jira/browse/COUCHDB-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frederik Fix updated COUCHDB-244: - Attachment: striped_queries.diff Striped queries --- Key: COUCHDB-244 URL: https://issues.apache.org/jira/browse/COUCHDB-244 Project: CouchDB Issue Type: New Feature Components: Database Core Reporter: Frederik Fix Attachments: striped_queries.diff I have implemented striped queries, where you can specify multiple startkey/endkey ranges in a single request. As a nice side effect this allows faceted search (for discrete keys). Heres an example: Say I have the following map function: function() { emit([doc.rooms, doc.price], doc); } where doc.rooms and doc.price are both integers. Now lets say i want to find every document with a number of rooms between 2 and 4 and a price between 100 and 1000. I can then do the following query: db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2, 1000]}, {startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100], endkey: [4, 1000]}]}); If the view included a reduce function that would work too. As you can probably see this patch introduced a change to the JS API (but not the HTTP). The keys parameter is now a hash which can either take a keys param or a stripes. The keys param works as before. The stripes param takes an array of hashes each having a startkey, endkey key. The state of the patch is still somewhat raw, with no error checking on the stripes part of the API. Furthermore it might be useful to extend the limit, skip and descending options to the stripes. The patch is against the current trunk version (rev 742925) and all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [Patch] striped queries/faceted search
Just noticed that the attachment got stripped. I've added an issue to JIRA here: https://issues.apache.org/jira/browse/COUCHDB-244 Frederik On 10 Feb 2009, at 13:00, Jan Lehnardt wrote: Hi Frederik, On 10 Feb 2009, at 11:54, Frederik Fix wrote: The patch is against the current trunk version (rev 742925) and all tests pass. what patch? :) Feel free to open a JIRA ticket and attach the patch there if this mailing list doesn't let you post attachments. https://issues.apache.org/jira/browse/COUCHDB Cheers Jan --
[jira] Created: (COUCHDB-244) Striped queries
Striped queries --- Key: COUCHDB-244 URL: https://issues.apache.org/jira/browse/COUCHDB-244 Project: CouchDB Issue Type: New Feature Components: Database Core Reporter: Frederik Fix Attachments: striped_queries.diff I have implemented striped queries, where you can specify multiple startkey/endkey ranges in a single request. As a nice side effect this allows faceted search (for discrete keys). Heres an example: Say I have the following map function: function() { emit([doc.rooms, doc.price], doc); } where doc.rooms and doc.price are both integers. Now lets say i want to find every document with a number of rooms between 2 and 4 and a price between 100 and 1000. I can then do the following query: db.view(my_view,{},{stripes: [{startkey: [2, 100], endkey: [2, 1000]}, {startkey: [3, 100], endkey: [3, 1000]}, {startkey: [4, 100], endkey: [4, 1000]}]}); If the view included a reduce function that would work too. As you can probably see this patch introduced a change to the JS API (but not the HTTP). The keys parameter is now a hash which can either take a keys param or a stripes. The keys param works as before. The stripes param takes an array of hashes each having a startkey, endkey key. The state of the patch is still somewhat raw, with no error checking on the stripes part of the API. Furthermore it might be useful to extend the limit, skip and descending options to the stripes. The patch is against the current trunk version (rev 742925) and all tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Stats Patch API Discussion
On Feb 10, 2009, at 10:19 AM, Jan Lehnardt wrote: Hi, Alex and I are working on our stats package patch and the last bigger issue is the API. It is just exposing a bunch of values by keys, but as usual, the devil is in the details. Let me explain. There are two types of counters. Hit Counters, that record things like the number of requests. They increase monotonically each time a request hits CouchDB. This is useful for counting stuff. Cool. Then there are Absolute Value Counter (for the lack of a better term) that collects absolute values like the number of milliseconds a request took to complete. To create a meaningful metric out of this type of counter, we need to create averages. There's little value in recording each individual request (it could still do that in the access logs) for monitoring reports. So we keep some aggregate values (min, max, mean, stddev, count (count being the number of times this counter was called)). Complexity++ Say you have a CouchDB running for a month. You change some things in your app or in CouchDB and you'd like to know how this affected your response time. To effectively see anything you'd have to restart CouchDB (and lose all stats) or wait a month. If you'd want to see problems coming up in your monitoring, you need finer grained time ranges to look at this. To make this a little more useful Alex and I introduced time ranges. These are an additional set of aggregates that get reset every 1, 5 and 15 minutes. This should be familiar to you from server load. You can get the aggregate values for four time ranges: - Between now and the beginning of time (when CouchDB is started. - Between now and 60 seconds ago. - Between now and 300 seconds ago - Between now and 900 seconds ago These ranges are hardcoded now, but they can be made configurable at a later time. The API would look like this: GET /_stats/couchdb/request_time { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since the beginning of time, min:20, max:20, mean:20, stddev:20, count:7, range:0 // 0 means since day zero. } } } To get the aggregates stats for the last minute: GET /_stats/couchdb/request_time?range=1 { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since 1 minute ago, min:20, max:20, mean:20, stddev:20, count:7, range:1 // minute } } } Or more generic: GET /_stats/couchdb/request_time?range=$range { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since $range minute ago, min:20, max:20, mean:20, stddev:20, count:7, range:$range // minute } } } This seems reasonable. the actual naming of range and other keys can be changed as well as the description text. Complexity-- Remember Hit Counters? Yes, strictly speaking, CouchDB shouldn't want to collect any averages there since our monitoring solution would take care of that. But then, there are the 4 time-range counters available and we could just as well populate them as well. Let's say every second: GET /_stats/httpd/requests[?$resolution=[1,5,15]] { couchdb: { request_time: { description: Number of requests per second seconds in the last $reolution minutes, min:20, max:20, mean:20, stddev:20, count:7, range:$range // minute } } } count would be the raw counter for the stats and the rest meaningful aggregates. per second is an arbitrary choice again and can be made configurable, if needed. To know at what frequency stats are collected, there's a new member in the list of aggregates: { couchdb: { request_time: { description: Number of requests per $frequency seconds in the last $reolution minutes, min:20, max:20, mean:20, stddev:20, count:7, range:$range, // minute frequency: 1 // second } } } Alex I tried to find a couple of different approaches to get here. Different URLs for the different types of counters and aggregates, adding members in different places, with and without description and a whole lot more, but we sure haven't seen all permutations. This solution offers a unified URL format and a human readable as well as a computer parseable way to determine what kind of counter you're dealing with. To just get all stats you can do a GET /_stats/ and get a huge JSON object back that includes all of the above for all resolutions that are currently collected. Is there anything that does not make sense or is too complicated? The goal was to create a simple, minimal API for a minimal set of useful statistics and Alex and I hope to have found this by now. But if you can see how this could be further simplified, let us know :) Alex and I also open for completely different approaches to get the data out of CouchDB. We're looking for a few things in this thread: - A sanity check to know we're not completely off. - A summary (for)
Re: Stats Patch API Discussion
On 10 Feb 2009, at 16:47, Zachary Zolton wrote: Jan, So you're saying I could run some test, and then hit: GET /_stats/couchdb/request_time?range=$SOME_MINUTES And then, make some changes, and run the same test: GET /_stats/couchdb/request_time?range=$SOME_MINUTES To detect the modulo in performance caused by my changes?!? Exactly, but where $SOME_MINUTES is hardcoded to 1, 5, or 15 to start with. If you have any reporting and graphing tool connected, you'd have pretty pictures, too :) That's about the level performance tuning I'm comfortable with doing in PostgreSQL, but it's all over HTTP instead. Nice! Thanks, Jan -- Cheers, Zach On Tue, Feb 10, 2009 at 9:19 AM, Jan Lehnardt j...@apache.org wrote: Hi, Alex and I are working on our stats package patch and the last bigger issue is the API. It is just exposing a bunch of values by keys, but as usual, the devil is in the details. Let me explain. There are two types of counters. Hit Counters, that record things like the number of requests. They increase monotonically each time a request hits CouchDB. This is useful for counting stuff. Cool. Then there are Absolute Value Counter (for the lack of a better term) that collects absolute values like the number of milliseconds a request took to complete. To create a meaningful metric out of this type of counter, we need to create averages. There's little value in recording each individual request (it could still do that in the access logs) for monitoring reports. So we keep some aggregate values (min, max, mean, stddev, count (count being the number of times this counter was called)). Complexity++ Say you have a CouchDB running for a month. You change some things in your app or in CouchDB and you'd like to know how this affected your response time. To effectively see anything you'd have to restart CouchDB (and lose all stats) or wait a month. If you'd want to see problems coming up in your monitoring, you need finer grained time ranges to look at this. To make this a little more useful Alex and I introduced time ranges. These are an additional set of aggregates that get reset every 1, 5 and 15 minutes. This should be familiar to you from server load. You can get the aggregate values for four time ranges: - Between now and the beginning of time (when CouchDB is started. - Between now and 60 seconds ago. - Between now and 300 seconds ago - Between now and 900 seconds ago These ranges are hardcoded now, but they can be made configurable at a later time. The API would look like this: GET /_stats/couchdb/request_time { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since the beginning of time, min:20, max:20, mean:20, stddev:20, count:7, range:0 // 0 means since day zero. } } } To get the aggregates stats for the last minute: GET /_stats/couchdb/request_time?range=1 { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since 1 minute ago, min:20, max:20, mean:20, stddev:20, count:7, range:1 // minute } } } Or more generic: GET /_stats/couchdb/request_time?range=$range { couchdb: { request_time: { description: Aggregated request time spent in CouchDB since $range minute ago, min:20, max:20, mean:20, stddev:20, count:7, range:$range // minute } } } This seems reasonable. the actual naming of range and other keys can be changed as well as the description text. Complexity-- Remember Hit Counters? Yes, strictly speaking, CouchDB shouldn't want to collect any averages there since our monitoring solution would take care of that. But then, there are the 4 time-range counters available and we could just as well populate them as well. Let's say every second: GET /_stats/httpd/requests[?$resolution=[1,5,15]] { couchdb: { request_time: { description: Number of requests per second seconds in the last $reolution minutes, min:20, max:20, mean:20, stddev:20, count:7, range:$range // minute } } } count would be the raw counter for the stats and the rest meaningful aggregates. per second is an arbitrary choice again and can be made configurable, if needed. To know at what frequency stats are collected, there's a new member in the list of aggregates: { couchdb: { request_time: { description: Number of requests per $frequency seconds in the last $reolution minutes, min:20, max:20, mean:20, stddev:20, count:7, range:$range, // minute frequency: 1 // second } } } Alex I tried to find a couple of different approaches to get here. Different URLs for the different types of counters and aggregates, adding members in different places, with and without description and a whole lot more, but we sure haven't seen all permutations. This solution offers a unified URL format and a human readable as well as a computer parseable way to determine what kind of counter you're dealing with. To just
Eunit
Hi, the previously mentioned stats patch introduces EUnit*-style unit tests for Erlang code. I believe this is useful for the rest of CouchDB as well. There is a simple test runner in test/ that includes a few tests for the couch_config* modules but that was never meant to be a permanent solution. * http://svn.process-one.net/contribs/trunk/eunit/doc/overview-summary.html EUnit is the de-facto unit testing tool for Erlang applications and it is even included in the latest distributions of Erlang/OTP. It's far from perfect, but I think CouchDB would benefit from adapting it, to gain and encourage writing standardized test cases for CouchDB modules. The one caveat with EUnit is that it is released under the LGPL. I am not a lawyer but the consensus on The Net is that writing test-cases against the EUnit API and conditionally including eunit.hrl to include the API does not mean that the test code itself must be released under the terms of the LGPL. If anyone is familiar with this, can you comment on whether this is correct? Technically, the EUnit tests for each module can be in the same file as the module's functions or in a separate directory. I'd opt for separating tests into their on directory, but I don't feel strongly about it. The tests should not interfere with production code. For this reason, there is a compile-time switch to enable tests. We can wire this up to `make test` so that EUnit-enabled CouchDB modules are built locally and then tested. For `make install`, the compile-time switch to enable tests would be off. Some modules have a few tests inline, once we would have the EUnit infrastructure in place, it'd be a good idea to modify the existing test to fit in. Permitted that there are no licensing issues, is there anything that would speak against adding EUnit tests to CouchDB? -- For now, the EUnit patch is entangled in the statistics patch, but we could separate that into its own. Would that be something that the community is interested in? Also, somebody please clear the legal issue :) Cheers Jan --
Re: Roadmap discussion
On Tue, Feb 10, 2009 at 11:46 AM, Kerr Rainey kerr.rai...@gmail.com wrote: Is there still interest in stabilising a native erlang interface? -- Kerr Definitely. I was contemplating this a bit the other day. I wonder if it wouldn't be beneficial to create a couch_api.erl and just define an erlang api that maps to what other client libraries look like. Then if someone wants to peek into the internals they're free and we can maintain that we only support compatibility on that one file. Any way, just an idle thought. HTH, Paul Davis
Re: Eunit
On Tue, Feb 10, 2009 at 5:52 PM, Jan Lehnardt j...@apache.org wrote: The one caveat with EUnit is that it is released under the LGPL. I am not a lawyer but the consensus on The Net is that writing test-cases against the EUnit API and conditionally including eunit.hrl to include the API does not mean that the test code itself must be released under the terms of the LGPL. If anyone is familiar with this, can you comment on whether this is correct? Best for you would be to send a question to legal-disc...@a.o. With my conservative hat on, I'm a bit concerned about LGPL virality in namespaced languages, and most definitely concerned with distribution of EUnit itself (I reckon this is not necessary as EUnit is part of OTP now?). -- Gianugo Rabellino Sourcesense, making sense of Open Source: http://www.sourcesense.com (blogging at http://www.rabellino.it/blog/)
Re: Roadmap discussion
* Full Text Search interface - We've had basically working patches for this floating around for a while. - It seems simple enough, we just need someone who comfortable in Java to step up to the plate and write a Lucene adapter. (Thanks!) I'm more than happy to look at this when I get time, I've been wondering where to start hacking on couch and we use solr at work (currently), so I would be able to justify some work time on it too Kev
Re: Roadmap discussion
I've made some progress on this, fwiw; http://github.com/rnewson/couchdb-lucene B. On Tue, Feb 10, 2009 at 12:27 PM, Kevin Jackson foamd...@gmail.com wrote: * Full Text Search interface - We've had basically working patches for this floating around for a while. - It seems simple enough, we just need someone who comfortable in Java to step up to the plate and write a Lucene adapter. (Thanks!) I'm more than happy to look at this when I get time, I've been wondering where to start hacking on couch and we use solr at work (currently), so I would be able to justify some work time on it too Kev
Re: 0.9.0 Delay or Release?
@Kerr that 0.9 does not imply next release is 1.0. Yeah, I was originally confused by that too! But, then I re-read this: http://en.wikipedia.org/wiki/Software_versioning#Software_versioning_schemes And, now I'm cool as a cucumber, WRT having 0.10 or even 0.1000...! LOL, it helps when I RTFM, I guess. :^P Cheers, Zach
Re: Roadmap discussion
On Tue, Feb 10, 2009 at 8:53 AM, Paul Davis paul.joseph.da...@gmail.com wrote: On Tue, Feb 10, 2009 at 11:46 AM, Kerr Rainey kerr.rai...@gmail.com wrote: Is there still interest in stabilising a native erlang interface? -- Kerr Definitely. I was contemplating this a bit the other day. I wonder if it wouldn't be beneficial to create a couch_api.erl and just define an erlang api that maps to what other client libraries look like. Then if someone wants to peek into the internals they're free and we can maintain that we only support compatibility on that one file. I've been interfacing with the raw Erlang API for a commercial project. It works like a charm, the only trouble being that it isn't documented, and that it could change out from under me with no warning. (Although the second caveat isn't as bad as it sounds, because it probably won't change much.) From my experience, I'm having a hard time seeing how any additional code could help make the Erlang API official. The project I'm working on has a very specific data model (no updates, lots of parallel attachment writing, using the HTTP API for everything but the critical path...) and using the Erlang API has allowed me to cut out a lot of code paths (eg rev checking etc). Doing this wouldn't be safe for a general purpose API, but when you are interfacing in Erlang, you're not using a general purpose API anyway. I'm happy to have an Erlang API, but maybe it should wait til sometime after 0.9. I think the best way to ensure that it's maintained as stable would be to have an Erlang integration suite, which could double as documentation. It certainly wouldn't hurt to have more Erlang tests, so maybe we can file this feature under testing for now, and hope we get an Erlang test suite created by interested parties. Once we have the test suite we'll know what the Erlang API is. -- Chris Anderson http://jchris.mfdz.com
Re: Eunit
On Tue, Feb 10, 2009 at 06:19:01PM +0100, Jan Lehnardt wrote: On 10 Feb 2009, at 18:11, Gianugo Rabellino wrote: On Tue, Feb 10, 2009 at 5:52 PM, Jan Lehnardt j...@apache.org wrote: The one caveat with EUnit is that it is released under the LGPL. I am not a lawyer but the consensus on The Net is that writing test-cases against the EUnit API and conditionally including eunit.hrl to include the API does not mean that the test code itself must be released under the terms of the LGPL. If anyone is familiar with this, can you comment on whether this is correct? Best for you would be to send a question to legal-disc...@a.o. With my conservative hat on, I'm a bit concerned about LGPL virality in namespaced languages, and most definitely concerned with distribution of EUnit itself (I reckon this is not necessary as EUnit is part of OTP now?). Thanks, I'll check with legal-discuss@ when this list agrees on adding EUnit support. Bundling EUnit is not necessary as of the latest OTP release and for earlier releases you need to install it manually or you can't run `make test` which is not too much of a problem, I'd say. Thanks Jan ^ I am in agreement with adding EUnit support to CouchDB. I find EUnit useful when writing or changing code. Also, I support a separate test directory from the src ../doc ../ebin ../include ../priv ../src ../test (though I know CouchDB does not presently follow that OTP recommended directory structure) I use EUnit in a simple manner based on a techniques derived from two erlang-questions threads, http://www.nabble.com/I-Hate-Unit-Testing...-td21697138.html http://www.nabble.com/Lightweight-test-driven-development-and-unit-testing-td21704767.html e.g. % the module to test -module(my_mod). -export( [public_funs ...] ). -include(../test/my_mod_test.erl). % yes, .erl blah blah blah all the module funs both exported and private . % end of my_mod.erl and ... % the testing module -module(my_mod_test.erl). -ifdef(test). -include_lib(eunit/include/eunit.hrl). first_test() - ?assert( true = first_fun(arg) ). second_test() - ?assert( {error, badarg} = second_fun(badarg) ). blah blah blah remaining tests -else. -endif. % end of my_mod_test.erl When compiling, simply do not -define( test ) and the my_mod_test module is not even included. Note that this technique allows testing of private funs in my_mod without having to export the private funs. I also use Emakefile such as ... %% %% to make from command line do following %%erl -make %% do not run tests from command line do following %%erl -pa ../ebin -eval eunit:test(my_mod, [verbose]), init:stop(). % {'*', [ {outdir, ../ebin} ,{i, ../include} ,{i, ../test} ,debug_info ,strict_record_tests ,netload , {d,debug} %% uncomment for debug , {d,test} %% uncomment for dev/test, do touch ../test/* ] }. ~Michael --- Portland, Oregon, USA http://autosys.us
Re: Eunit
Hi Micheal, On 10 Feb 2009, at 19:29, Michael McDaniel wrote: Thanks, I'll check with legal-discuss@ when this list agrees on adding EUnit support. Bundling EUnit is not necessary as of the latest OTP release and for earlier releases you need to install it manually or you can't run `make test` which is not too much of a problem, I'd say. Thanks Jan ^ I am in agreement with adding EUnit support to CouchDB. I find EUnit useful when writing or changing code. Also, I support a separate test directory from the src ../doc ../ebin ../include ../priv ../src ../test (though I know CouchDB does not presently follow that OTP recommended directory structure) I don't want to complicate the proposed patches. I'm not for or against this, but we should think about that separately. Maybe open a JIRA ticket so we don't forget about this that includes a short description of the benefits? [...] When compiling, simply do not -define( test ) and the my_mod_test module is not even included. Note that this technique allows testing of private funs in my_mod without having to export the private funs. I also use Emakefile such as ... Since we're using good old make, this would look a little different in practice, but would do the same thing :) Thanks for chiming and the tips. Cheers Jan --
Re: Roadmap discussion
2009/2/10 Michael McDaniel couc...@autosys.us: ... also, an Erlang API that skips the JSON -convert- native Erlang terms translation overhead. Being as term translation is not necessary when talking 'directly' with the CDB engine (e.g. couch_query_servers:map_docs/2 could skip the JSON - term() translation if the view engine reads/writes native Erlang terms) Interesting. I'd certainly consider this another level further than what I was thinking of, or indeed would be thinking of using. There is probably a few levels where couch functionality could be exposed natively. I wonder how much doing this kind of bypassing for a native erlang view engine would complicate the code? Or would it give another clean layer? -- Kerr
View Intersections
I've been contemplating implementing a new feature that I've been wanting for awhile. There's been some talk of implementing view intersections for a bit now so I figured I'd try and give a summary of what the feature would entail in terms of functionality and then the necessary bits required for an implementation. So the original idea for view intersections was exactly what the name entails: Show me the intersection between two views for a given set of view query parameters. After thinking about different methods of implementation I think we can extend this to be more powerful and generally applicable. Major Hurdle 1 The first necessary bit of ground work would be to implement an optional value index on views. The more I thought about intersecting views the more I realized it was starting to look pointless. Ignoring something along the lines of group_level=N in that we can join on array prefixes, all views being joined would require exactly the same key. Which begs the question, why not just create 1 view that emits the output of the two you want intersected. I couldn't get past this for a long time until I heard Chris Anderson pondering adding a btree index to the values in a view. The obvious downfalls of the extra space and computation usage are there, but making it optional should solve any qualms in that respect. Given an index on a value we're now able to chain together arbitrary views using either the key or value as well as limit the intersection by any combination of key and value. As a side benefit, we would also get the select views by value restriction as well. I'm thinking it'd be as transparent as adding a [start|end]value and [start|end]value_docid set of URL parameters. I haven't followed this train of thought too far into the code yet, but something approximating that should be fairly doable. A thought occurs that limiting view results by both key and value could be interesting in terms of implementation. Not sure if I'd force it through the intersection API or not. Caveats that come to mind are that this would break binary compatibility for all generated views. It wouldn't require a dump/reload, but it might come as a surprise to people upgrading that all their views are regenerating. Major Hurdle 2 Implementing the view intersection API. First off, it probably needs a new name. Once we have intersections working, unions, subtractions, and the NxM one who's name escapes me (cross product floats up but sounds not right) should be trivially implementable. The underlying implementation for this is basically a large merge sort running over the view btree's. If you read about the merge step in map/reduce/merge that's basically what I've got in my head. The biggest issue that I've found in getting this implemented (excluding a value index) is that I'd need to write a new btree traversal method that used iterators instead of a fold mechanism. This shouldn't be overly difficult to implement. Beyond that then it's basically up to the HTTP interface in parameter parsing and error checking. For passing parameters I'm thinking along the line of a JSON body posted (Premptive: any RESTafarians should reference the long discussion on multi-get before writing about how this isn't RESTful). Also, not sure if it's obvious but I'd plan on allowing arbitrarily nested conditions, ie, intersection(union(a, b), c) type of operations. There's a subtle detail in the sort order and thus corresponding btree traversal that might come into play there. I can punt and make the entire request use one sort order, as in the previous example can't specify different sort directions for the two nested operations because you'd get a (presumably) zero overlap in the end. I'm pretty sure if we force all btrees to be traversed in the same direction for each request we don't lose any functionality though. Comments = That's the general outline I've got in my head right now. I'm pretty sure I can see 95% of the implementation, but it's possible I'm missing a finer detail somewhere. If you've got questions or comments let's hear them. If there's no general objection then I can probably get to starting an implementation at the end of this week. Thanks, Paul Davis
Re: Stats Patch API Discussion
CouchDB is designed so that it can crash and restart. What possibilities are there for having statistics persisted between runs, for this reason? -- Noah Slater, http://tumbolia.org/nslater
Re: Stats Patch API Discussion
On Tue, Feb 10, 2009 at 8:11 PM, Noah Slater nsla...@apache.org wrote: CouchDB is designed so that it can crash and restart. What possibilities are there for having statistics persisted between runs, for this reason? I'd argue that we should let the stats collection packages deal with persisting anything that needs it. I tend to agree with Jan's earlier comments that this should be about generating data and we leave the pretty graphs to dedicated software. -- Noah Slater, http://tumbolia.org/nslater HTH, Paul Davis
Re: View Intersections
Just a few comment to get things started. On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com wrote: I've been contemplating implementing a new feature that I've been wanting for awhile. There's been some talk of implementing view intersections for a bit now so I figured I'd try and give a summary of what the feature would entail in terms of functionality and then the necessary bits required for an implementation. So the original idea for view intersections was exactly what the name entails: Show me the intersection between two views for a given set of view query parameters. After thinking about different methods of implementation I think we can extend this to be more powerful and generally applicable. Major Hurdle 1 The first necessary bit of ground work would be to implement an optional value index on views. The more I thought about intersecting views the more I realized it was starting to look pointless. Ignoring something along the lines of group_level=N in that we can join on array prefixes, all views being joined would require exactly the same key. Which begs the question, why not just create 1 view that emits the output of the two you want intersected. I would argue that returning a simple list of docids that meet the requirement should suffice -- in fact, the views a and b need not be homogenous so returning anything beyond docids could end up being a bigger problem than the intersection itself. For instance, say we want the intersection of the documents who have both blue and fuzzy tags so we use a = /_view/tags/byval?key=blue b = /_view/tags/byval?key=fuzzy intersection(a,b) Now we want to limit that to things named Harold. c=/_view/name/first?key=Harold intersection(intersection(a,b),c) Which gives us a list of docid's that contain Blue, Fuzzy things named Harold. However, the values returned by view a and view b are the same, however the values returned by view c might be completely different. So returning a view with varying values might not be very helpful (This is where I am not seeing why more than returning a list of docid's would be appropriate. Of course I am most likely missing the point.) Only returning intersections of similar views would not be as interesting a returning intersections of dissimilar views. I couldn't get past this for a long time until I heard Chris Anderson pondering adding a btree index to the values in a view. The obvious downfalls of the extra space and computation usage are there, but making it optional should solve any qualms in that respect. Given an index on a value we're now able to chain together arbitrary views using either the key or value as well as limit the intersection by any combination of key and value. As a side benefit, we would also get the select views by value restriction as well. I'm thinking it'd be as transparent as adding a [start|end]value and [start|end]value_docid set of URL parameters. I haven't followed this train of thought too far into the code yet, but something approximating that should be fairly doable. A thought occurs that limiting view results by both key and value could be interesting in terms of implementation. Not sure if I'd force it through the intersection API or not. Caveats that come to mind are that this would break binary compatibility for all generated views. It wouldn't require a dump/reload, but it might come as a surprise to people upgrading that all their views are regenerating. Major Hurdle 2 Implementing the view intersection API. First off, it probably needs a new name. Once we have intersections working, unions, subtractions, and the NxM one who's name escapes me (cross product floats up but sounds not right) should be trivially implementable. The underlying implementation for this is basically a large merge sort running over the view btree's. If you read about the merge step in map/reduce/merge that's basically what I've got in my head. The biggest issue that I've found in getting this implemented (excluding a value index) is that I'd need to write a new btree traversal method that used iterators instead of a fold mechanism. This shouldn't be overly difficult to implement. Beyond that then it's basically up to the HTTP interface in parameter parsing and error checking. For passing parameters I'm thinking along the line of a JSON body posted (Premptive: any RESTafarians should reference the long discussion on multi-get before writing about how this isn't RESTful). posting json documents seems to be required and beyond argument given the technical size limits of a GET request Also, not sure if it's obvious but I'd plan on allowing arbitrarily nested conditions, ie, intersection(union(a, b), c) type of operations. There's a subtle detail in the sort order and thus corresponding btree traversal that might come into play there. I can punt and make the entire request use one
Re: View Intersections
On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DMT dunde...@gmail.com wrote: Just a few comment to get things started. On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com wrote: I've been contemplating implementing a new feature that I've been wanting for awhile. There's been some talk of implementing view intersections for a bit now so I figured I'd try and give a summary of what the feature would entail in terms of functionality and then the necessary bits required for an implementation. So the original idea for view intersections was exactly what the name entails: Show me the intersection between two views for a given set of view query parameters. After thinking about different methods of implementation I think we can extend this to be more powerful and generally applicable. Major Hurdle 1 The first necessary bit of ground work would be to implement an optional value index on views. The more I thought about intersecting views the more I realized it was starting to look pointless. Ignoring something along the lines of group_level=N in that we can join on array prefixes, all views being joined would require exactly the same key. Which begs the question, why not just create 1 view that emits the output of the two you want intersected. I would argue that returning a simple list of docids that meet the requirement should suffice -- in fact, the views a and b need not be homogenous so returning anything beyond docids could end up being a bigger problem than the intersection itself. For instance, say we want the intersection of the documents who have both blue and fuzzy tags so we use a = /_view/tags/byval?key=blue b = /_view/tags/byval?key=fuzzy intersection(a,b) Now we want to limit that to things named Harold. c=/_view/name/first?key=Harold intersection(intersection(a,b),c) Which gives us a list of docid's that contain Blue, Fuzzy things named Harold. However, the values returned by view a and view b are the same, however the values returned by view c might be completely different. So returning a view with varying values might not be very helpful (This is where I am not seeing why more than returning a list of docid's would be appropriate. Of course I am most likely missing the point.) Only returning intersections of similar views would not be as interesting a returning intersections of dissimilar views. I couldn't get past this for a long time until I heard Chris Anderson pondering adding a btree index to the values in a view. The obvious downfalls of the extra space and computation usage are there, but making it optional should solve any qualms in that respect. Given an index on a value we're now able to chain together arbitrary views using either the key or value as well as limit the intersection by any combination of key and value. As a side benefit, we would also get the select views by value restriction as well. I'm thinking it'd be as transparent as adding a [start|end]value and [start|end]value_docid set of URL parameters. I haven't followed this train of thought too far into the code yet, but something approximating that should be fairly doable. A thought occurs that limiting view results by both key and value could be interesting in terms of implementation. Not sure if I'd force it through the intersection API or not. Caveats that come to mind are that this would break binary compatibility for all generated views. It wouldn't require a dump/reload, but it might come as a surprise to people upgrading that all their views are regenerating. Major Hurdle 2 Implementing the view intersection API. First off, it probably needs a new name. Once we have intersections working, unions, subtractions, and the NxM one who's name escapes me (cross product floats up but sounds not right) should be trivially implementable. The underlying implementation for this is basically a large merge sort running over the view btree's. If you read about the merge step in map/reduce/merge that's basically what I've got in my head. The biggest issue that I've found in getting this implemented (excluding a value index) is that I'd need to write a new btree traversal method that used iterators instead of a fold mechanism. This shouldn't be overly difficult to implement. Beyond that then it's basically up to the HTTP interface in parameter parsing and error checking. For passing parameters I'm thinking along the line of a JSON body posted (Premptive: any RESTafarians should reference the long discussion on multi-get before writing about how this isn't RESTful). posting json documents seems to be required and beyond argument given the technical size limits of a GET request Also, not sure if it's obvious but I'd plan on allowing arbitrarily nested conditions, ie, intersection(union(a, b), c) type of operations. There's a subtle detail in the sort order and thus
[jira] Created: (COUCHDB-245) Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead.
Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead. --- Key: COUCHDB-245 URL: https://issues.apache.org/jira/browse/COUCHDB-245 Project: CouchDB Issue Type: Bug Components: Infrastructure Affects Versions: 0.7.2, 0.8, 0.8.1 Environment: regexp is set to be removed from stdlib when R15 is released. Reporter: alisdair sullivan Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead. re is not a drop in replacement for regexp, it operates on and returns binary strings instead of native strings. Affects files couch_config.erl, couch_config_writer.erl, couch_httpd.erl, couch_httpd_server.erl, couch_log.erl and couch_server.erl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-245) Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead.
[ https://issues.apache.org/jira/browse/COUCHDB-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alisdair sullivan updated COUCHDB-245: -- Affects Version/s: 0.9 Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead. --- Key: COUCHDB-245 URL: https://issues.apache.org/jira/browse/COUCHDB-245 Project: CouchDB Issue Type: Bug Components: Infrastructure Affects Versions: 0.7.2, 0.8, 0.8.1, 0.9 Environment: regexp is set to be removed from stdlib when R15 is released. Reporter: alisdair sullivan Original Estimate: 2h Remaining Estimate: 2h Couch uses the erlang stdlib module regexp, which is deprecated and set to be removed. It should use the module re instead. re is not a drop in replacement for regexp, it operates on and returns binary strings instead of native strings. Affects files couch_config.erl, couch_config_writer.erl, couch_httpd.erl, couch_httpd_server.erl, couch_log.erl and couch_server.erl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: View Intersections
On Tue, Feb 10, 2009 at 9:58 PM, Paul Davis paul.joseph.da...@gmail.com wrote: On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DMT dunde...@gmail.com wrote: Just a few comment to get things started. On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis paul.joseph.da...@gmail.com wrote: I've been contemplating implementing a new feature that I've been wanting for awhile. There's been some talk of implementing view intersections for a bit now so I figured I'd try and give a summary of what the feature would entail in terms of functionality and then the necessary bits required for an implementation. So the original idea for view intersections was exactly what the name entails: Show me the intersection between two views for a given set of view query parameters. After thinking about different methods of implementation I think we can extend this to be more powerful and generally applicable. Major Hurdle 1 The first necessary bit of ground work would be to implement an optional value index on views. The more I thought about intersecting views the more I realized it was starting to look pointless. Ignoring something along the lines of group_level=N in that we can join on array prefixes, all views being joined would require exactly the same key. Which begs the question, why not just create 1 view that emits the output of the two you want intersected. I would argue that returning a simple list of docids that meet the requirement should suffice -- in fact, the views a and b need not be homogenous so returning anything beyond docids could end up being a bigger problem than the intersection itself. For instance, say we want the intersection of the documents who have both blue and fuzzy tags so we use a = /_view/tags/byval?key=blue b = /_view/tags/byval?key=fuzzy intersection(a,b) Now we want to limit that to things named Harold. c=/_view/name/first?key=Harold intersection(intersection(a,b),c) Which gives us a list of docid's that contain Blue, Fuzzy things named Harold. However, the values returned by view a and view b are the same, however the values returned by view c might be completely different. So returning a view with varying values might not be very helpful (This is where I am not seeing why more than returning a list of docid's would be appropriate. Of course I am most likely missing the point.) Only returning intersections of similar views would not be as interesting a returning intersections of dissimilar views. I couldn't get past this for a long time until I heard Chris Anderson pondering adding a btree index to the values in a view. The obvious downfalls of the extra space and computation usage are there, but making it optional should solve any qualms in that respect. Given an index on a value we're now able to chain together arbitrary views using either the key or value as well as limit the intersection by any combination of key and value. As a side benefit, we would also get the select views by value restriction as well. I'm thinking it'd be as transparent as adding a [start|end]value and [start|end]value_docid set of URL parameters. I haven't followed this train of thought too far into the code yet, but something approximating that should be fairly doable. A thought occurs that limiting view results by both key and value could be interesting in terms of implementation. Not sure if I'd force it through the intersection API or not. Caveats that come to mind are that this would break binary compatibility for all generated views. It wouldn't require a dump/reload, but it might come as a surprise to people upgrading that all their views are regenerating. Major Hurdle 2 Implementing the view intersection API. First off, it probably needs a new name. Once we have intersections working, unions, subtractions, and the NxM one who's name escapes me (cross product floats up but sounds not right) should be trivially implementable. The underlying implementation for this is basically a large merge sort running over the view btree's. If you read about the merge step in map/reduce/merge that's basically what I've got in my head. The biggest issue that I've found in getting this implemented (excluding a value index) is that I'd need to write a new btree traversal method that used iterators instead of a fold mechanism. This shouldn't be overly difficult to implement. Beyond that then it's basically up to the HTTP interface in parameter parsing and error checking. For passing parameters I'm thinking along the line of a JSON body posted (Premptive: any RESTafarians should reference the long discussion on multi-get before writing about how this isn't RESTful). posting json documents seems to be required and beyond argument given the technical size limits of a GET request Also, not sure if it's obvious but I'd plan on allowing arbitrarily nested conditions, ie, intersection(union(a, b),
Helping out
Hello. I've been following CouchDB from the sidelines for a while but haven't been able to put much time into it. Recently, however, Sun laid me off, and I thought this would be a good opportunity to get a little more engaged. No better way, IMHO, than to help out with the project. FYI, I'm already a committer to Apache Derby, although I haven't been active there in the past few years. I was looking at your road map and it looked like you want to get a lot of documentation written. I was thinking that would be a great way for me to start learning CouchDB. Is there a specific document that you would like to me to try my hand at? Also, what are your processes, technologies and standards around documentation? I can also start poking around at your bug list and perhaps offer some patches to get my feet wet. Is there anything in particular that you would like someone to focus on? I don't have an Erlang background, although I'm interested in learning. My background is server-side Java and databases, for the most part. I look forward to hearing from you. Meanwhile I'll try to get a build going and see how that goes. All the best, David -- David W. Van Couvering http://davidvancouvering.blogspot.com
[jira] Created: (COUCHDB-246) allow customization of external process timeout
allow customization of external process timeout --- Key: COUCHDB-246 URL: https://issues.apache.org/jira/browse/COUCHDB-246 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.9 Reporter: Robert Newson Priority: Blocker Fix For: 0.9 If an external process takes too long to respond, it is killed. The timeout is quite short (a few seconds) and is not configurable from .ini files today. couchdb-lucene could use this ability as the first attempt to sort on a field in a large index is slow while it builds a cache. With the timeout, it's killed and the partial work is lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (COUCHDB-246) allow customization of external process timeout
[ https://issues.apache.org/jira/browse/COUCHDB-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672520#action_12672520 ] Antony Blakey commented on COUCHDB-246: --- I have this problem with _externals that implement lazy-view update semantics. The problem being that the amount of work required for the _external to catchup to the current update_seq is unknown. I experimented with a solution that allows the external to return a keep-alive message to the server, which doesn't return a value to the client but does stop the server killing the external. I got distracted and didn't complete that work, but I think this is a better solution than a fixed timeout. The problem with a timeout is that it doesn't account for machine performance or load, or the possible highly variable amount of work that the external needs to do on a per-request basis, whereas a keep-alive more correctly captures what you want e.g. the external process is making progress. Such a keep alive could specify a timeout value, so that the external process could control the definition of failure according to how often it will send keep-alives, but that might be an unnecessary complication. allow customization of external process timeout --- Key: COUCHDB-246 URL: https://issues.apache.org/jira/browse/COUCHDB-246 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.9 Reporter: Robert Newson Priority: Blocker Fix For: 0.9 If an external process takes too long to respond, it is killed. The timeout is quite short (a few seconds) and is not configurable from .ini files today. couchdb-lucene could use this ability as the first attempt to sort on a field in a large index is slow while it builds a cache. With the timeout, it's killed and the partial work is lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Helping out
On Wed, Feb 11, 2009 at 12:27 AM, David Van Couvering da...@vancouvering.com wrote: Hello. I've been following CouchDB from the sidelines for a while but haven't been able to put much time into it. Recently, however, Sun laid me off, and I thought this would be a good opportunity to get a little more engaged. No better way, IMHO, than to help out with the project. FYI, I'm already a committer to Apache Derby, although I haven't been active there in the past few years. I was looking at your road map and it looked like you want to get a lot of documentation written. I was thinking that would be a great way for me to start learning CouchDB. Is there a specific document that you would like to me to try my hand at? Also, what are your processes, technologies and standards around documentation? I can also start poking around at your bug list and perhaps offer some patches to get my feet wet. Is there anything in particular that you would like someone to focus on? I don't have an Erlang background, although I'm interested in learning. My background is server-side Java and databases, for the most part. I look forward to hearing from you. Meanwhile I'll try to get a build going and see how that goes. All the best, David -- David W. Van Couvering http://davidvancouvering.blogspot.com David, It's awesome to hear your interest especially given your recent situation. Re: Documentation As far as I'm aware the only guidelines in terms of documentation are to put things on the wiki. I would say that if a specific section of CouchDB interests you, start learning the code base from that aspect and add good wiki information on it. I know that I, for one, am not the most vigilant in keeping things in sync. Another aspect to documentation would also be documenting the Erlang documentation best practices. It doesnt sound as sexy, but getting a good set of rules for native Erlang documentation would be a Good Thing trade;. There have been attempts at getting autogenerated docs. Having a good distillation of rules as well as a working build integration with the website would be an awesome advancement. Re: Patches The two biggest suggestions I have would be to start reading code via the *_httpd_*.erl sources. In terms of behavior, these have the most documentation as well as being a very logical root point to start tracing code paths. If you have something that tickles your fancy, its fun to follow an HTTP request all the way to disk. I took a shining to view generation and ended up reading through the btree code. There's lots of the seductive, No fucking way it can be this easy type of code that makes the internals fun to read through. My other suggestion is fairly closely related. Start walking through the list of bugs that are blocking for 0.9 and see what you're comfortable dealing with. I'd definitely suggest adding comments to bugs or popping on IRC if you find something approachable. JIRA is a PITA when it comes to assigning things, so I spend a good chunk of my time trying to remember if someone on the ML or IRC claimed progress or on going work. For reference, Jan has an awesome page setup that will get you the list of blocking issues for 0.9 at [1]. Hopefully he'll keep it updated beyond the 0.9 release. [1] http://jan.prima.de/fuckjira.html HTH and welcome to the community, Paul Davis