On Fri, Mar 23, 2012 at 11:48 AM, Logan Bell <[email protected]> wrote: > Would anyone be opposed if I fleshed out the documentation around the > following links to explain a couple patterns that his e-mail chain reminded > me of when I first started Lucy?
You've identified a common question, all right, and I think addressing it in our official documentation would be a nice improvement. :) > The documents in question are: > http://incubator.apache.org/lucy/docs/perl/Lucy/Search/IndexSearcher.html > http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Tutorial/BeyondSimple.html It might be a little tricky to integrate this into IndexSearcher's reference docs, so I would advocate either integrating it into the Tutorial, or perhaps better yet, writing a short Cookbook entry and linking to it from the Tutorial. Not every Cookbook entry has to be as long as CustomQuery or CustomQueryParser! > It's not clear how to obtain all documents associated with a query and that > the num_wanted value defaulted to 10. I would like to give an example of > how one might get all results and also update the IndexSearcher > documentation to mention that num_wanted is defaulted to 10 (with an offset > of 0). The reason we haven't documented this idiom before is because we don't really want to encourage people to use it -- users should be shunted towards a best practice of paging through hits. The memory consumed during search when you say "give me *all* matches" scales with index size, and can get out of control with large indexes. Nevertheless, it's such a common question that we ought to make it easy to find the answer. > my $doc_count = $searcher->doc_max; > my $hits = $searcher->hits( # returns a Hits object, not a hit count > query => 'foo', > num_wanted => $doc_count, > ); IMO, this code sample would be improved by using "$doc_max" as the variable name. As a matter of coding style, I think it's desirable to associate the name of the variable with the name of the method where the value came from. But more importantly, "Doc_Count" is actually an IndexReader method which does something slightly different from "Doc_Max": /** Return the maximum number of documents available to the reader, which * is also the highest possible internal document id. Documents which * have been marked as deleted but not yet purged from the index are * included in this count. */ public abstract int32_t Doc_Max(IndexReader *self); /** Return the number of documents available to the reader, subtracting * any that are marked as deleted. */ public abstract int32_t Doc_Count(IndexReader *self); Doc_Max() is what you want whenever you're allocating space to hold document numbers, like we are here. Marvin Humphrey
