On Fri, Mar 23, 2012 at 1:33 PM, Marvin Humphrey <[email protected]>wrote:
> On Fri, Mar 23, 2012 at 11:48 AM, Logan Bell <[email protected]> wrote: > > Would anyone be opposed if I fleshed out the documentation around the > > following links to explain a couple patterns that his e-mail chain > reminded > > me of when I first started Lucy? > > You've identified a common question, all right, and I think addressing it > in > our official documentation would be a nice improvement. :) > > > The documents in question are: > > > http://incubator.apache.org/lucy/docs/perl/Lucy/Search/IndexSearcher.html > > > http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/Tutorial/BeyondSimple.html > > It might be a little tricky to integrate this into IndexSearcher's > reference > docs, so I would advocate either integrating it into the Tutorial, or > perhaps > better yet, writing a short Cookbook entry and linking to it from the > Tutorial. Not every Cookbook entry has to be as long as CustomQuery or > CustomQueryParser! > +1 > > > It's not clear how to obtain all documents associated with a query and > that > > the num_wanted value defaulted to 10. I would like to give an example of > > how one might get all results and also update the IndexSearcher > > documentation to mention that num_wanted is defaulted to 10 (with an > offset > > of 0). > > The reason we haven't documented this idiom before is because we don't > really > want to encourage people to use it -- users should be shunted towards a > best > practice of paging through hits. > > The memory consumed during search when you say "give me *all* matches" > scales > with index size, and can get out of control with large indexes. > > Nevertheless, it's such a common question that we ought to make it easy > to find the answer. > Agreed - perhaps with a stern caveat/warning that this is not advocated for large indexes. Surely paging is what we ultimately want. > > > my $doc_count = $searcher->doc_max; > > my $hits = $searcher->hits( # returns a Hits object, not a hit count > > query => 'foo', > > num_wanted => $doc_count, > > ); > > IMO, this code sample would be improved by using "$doc_max" as the variable > name. As a matter of coding style, I think it's desirable to associate the > name of the variable with the name of the method where the value came from. > But more importantly, "Doc_Count" is actually an IndexReader method which > does > something slightly different from "Doc_Max": > > /** Return the maximum number of documents available to the reader, > which > * is also the highest possible internal document id. Documents which > * have been marked as deleted but not yet purged from the index are > * included in this count. > */ > public abstract int32_t > Doc_Max(IndexReader *self); > > /** Return the number of documents available to the reader, subtracting > * any that are marked as deleted. > */ > public abstract int32_t > Doc_Count(IndexReader *self); > > Doc_Max() is what you want whenever you're allocating space to hold > document > numbers, like we are here. > Sure, probably a better var name. However this surfaces another question for myself and potentially for the documentation, is it possible to obtain all documents excluding the ones marked for deletion? Thanks! Logan > > Marvin Humphrey >
