Greets,

(I'm cc'ing this to [email protected], because I think Lucy should follow the same design principles described in this post.)

KinoSearch is spinning off a few modules, to cut down on the core size and complexity. For the present time, they will continue to be distributed with the KinoSearch tarball, but eventually they will become separate distributions.

KinoSearch::Search::SearchServer and KinoSearch::Search::SearchClient have moved to KSx::Remote::SearchServer and KSx::Remote::SearchClient. Eventually, they will be distributed under KSx::Remote.

The rationale for breaking out SearchServer/SearchClient is that there are many ways to have machines interconnect; the Socket/faked-up-rpc approach taken by SearchClient/SearchServer, the XML approach used by Solr, etc. For core, it is only crucial that the messages that have to be sent over the network be serializable using *some* technique -- it's not important what technique is chosen.

The other spinoff is Filter. KinoSearch::Search::Filter, KinoSearch::Search::QueryFilter, and KinoSearch::Search::PolyFilter have all been removed; their functionality is now encapsulated in KSx::Search::Filter, which has been refactored as a subclass of Query. The last filter subclass, KinoSearch::Search::RangeFilter, has been replaced by a new core class, KinoSearch::Search::RangeQuery (which behaves similarly to Lucene's ConstantScoringRangeQuery with a fixed score of 0).

The standard KS search methods no longer take a 'filter' argument. Here's the new Filter API in action:

  my %category_filters;
  for my $category (qw( sweet sour salty bitter )) {
    my $cat_query  = KinoSearch::Search::TermQuery->new(
      field => 'category',
      term  => $category,
    );
    $category_filters{$category} = KSx::Search::Filter->new(
       query => $cat_query,
    );
  }

  while ( my $cgi = CGI::Fast->new ) {
    my $user_query = $cgi->param('q');
    my $filter = $category_filters{$cgi->param('category')};
    my $and_query = KinoSearch::Search::ANDQuery->new;
    $and_query->add_child($user_query);
    $and_query->add_child($filter);
    my $hits = $searcher->search( query => $and_query );
    ...

Filter is moving outside of core because it is essentially nothing more a caching optimization. Logically, the following code would produce exactly the same results as the code above:

  while ( my $cgi = CGI::Fast->new ) {
    my $user_query = $cgi->param('q');
    my $category_query = KinoSearch::Search::TermQuery->new(
      field => 'category',
      term  => $cgi->param('category'),
    );
    $category_query->set_boost(0);
    my $and_query = KinoSearch::Search::ANDQuery->new;
    $and_query->add_child($user_query);
    $and_query->add_child($category_query);
    my $hits = $searcher->search( query => $and_query );
    ...

The only significant differences are that the Filter only runs the query once, and that it can't be serialized and sent over the network in a search cluster (because the search results are cached in a BitVector which is too big to send).

Lucene provides classes called RemoteCachingWrapperFilter and FilterManager that address the problem of filter caching in search clusters, and whose functionality might eventually end up in either KSx::Remote or KSx::Search::Filter. Again, though, they are caching optimizations with serialization limitations and as such belong outside of core.

I thought about keeping Filter as an abstract base class, and putting the actual functionality into KSx::Search::QueryFilter or something like that. However, after reviewing the various Filter subclasses in both Lucene's core and contrib, it looked to me as though nearly all of them (all except for the SpanFilter subclasses which would need to be different anyway) could be realized using either ordinary Queries or Queries in conjunction with this new implementation of Filter.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reply via email to