On Fri, Sep 09, 2011 at 09:03:55AM +0200, goran kent wrote:
> On Thu, Sep 8, 2011 at 10:05 PM, Marvin Humphrey <[email protected]> 
> wrote:
> > If you like, take a look at LucyX::Remote::SearchServer and
> > LucyX::Remote::SearchClient.  They're implemented in pure Perl, so Perl
> > programmers generally find them easier to grok than other parts of the
> > library.
> 
> Presumably though the Perl interface is calling the super-fast C
> library so there's no performance penalty, right?

Correct -- the LucyX::Remote classes are just small RPC wrappers.

> > Lucy supports sorting on the lexical value of text fields.  You can use
> > sprintf() or the like to stringify floats so that field values sort 
> > lexically
> > in the same order as they would numerically.
> 
> Ignorant question:  stringify during indexing, or searching?

You must stringify during indexing.

Here's example indexing code where both the "date" and "rainfall" fields are
prepared using sprintf() so that they will sort in the desired order:

    my %doc = (
        location => $location,
        date     => sprintf("%0.4d-%0.2d-%0.2d", $year, $month, $day),
        rainfall => sprintf("%07.4f", $cm_rainfall), # e.g. 01.1482
    );
    $indexer->add_doc(\%doc);

Of course you have to choose an adequate precision so you don't overflow and
screw up the sort order. :)

> 1.  Since my search results are always ranked by my own sort-field,
> does Lucy skip unimportant docs - ie those with too-low sort-score,
> unless minimum results are not found?

I'm not sure what kind of an optimization you have in mind, but Lucy doesn't
really skip docs in the way that you're describing.

During searching, Lucy proceeds from matching doc id to matching doc id.
Matches are deposited into a Collector; if the Collector requires a score,
then a score will be calculated for each matching doc as needed.

The Collector which gets used 99% of the time is a SortCollector, which keeps
matches in a priority queue.  Poor matches either fall out the bottom of the
queue, or once the queue is full, fail to displace an existing match.

By the time we determine that a match wasn't good enough to keep, we've
already most of the work.  So I think the answer to your question is: no, Lucy
does not "skip unimportant docs".   There are optimizations in place which
"skip" certain chunks of index data, but they are concerned with lowering the
cost of proceeding from match to match, not with hopping over "unimportant"
matches.

> 2.  Highlighting - does it use regex, perl, or pure-C - I've found in
> some libraries that highlighting can impact heavily on performance.
> ie, search returns sub-second, but building the snippets and
> highlighting for a typical 25-50 item page can take 0.5s - second or
> more on it's own (particularly if there are hundreds of highlighted
> terms).

Lucy's highlighter is coded in C and uses positional information built up at
index time.
 
> 3.  Short-circuiting search:  let's say the index has 100m docs.
> Since I only intend ever displaying a max of (say) 500 results (broken
> into pages of 10/25/50 each), does Lucy grind away at the entire index
> to proudly let me know that it found 50,000 results, even though I'm
> only displaying a small subset, or does it take cognizance of my
> intent to only display 500, find the top 500 (eg, based on my custom
> sort field which is a relevance-indicator), but still report that it
> "found" 50,000 (ie, 50k possible results) results?  Sorry, I know this
> is a rather staggered/not-well-presented question, so I hope I made
> sense.

The algorithm I think you're reaching for is sometimes called "early
termination".  The way it works is that you have the Collector throw an
exception once the desired number of documents have been collected to break
out of the collection loop and you have the Searcher catch that exception.

However, I think I spy a difficulty in your specification: you want to "find
the top 500" documents, but you don't want to search all documents.  If you
stop early, how do you know that no document yet to come scores higher than
the ones you've seen so far?

The general approach that people take with early termination is to ensure that
documents are iterated from "most important" to "least important" when
searching.  That means controlling the order in which they appear in the
index -- for instance by queueing documents according to pagerank prior to
indexing.  The limitations of such an approach are:

  1) It is not friendly to index updates.
  2) You can only pre-sort the index on *one* sort criteria.

Marvin Humphrey

Reply via email to