On Sat, Jun 13, 2009 at 01:08:39PM -0700, [email protected] wrote:
> I’m trying to use QueryParser with WildCardQuery/RegexpTermQuery. I  
> have a subclass of QueryParser that overrides the make_term_query  
> method to return a regexp query (a Perl subclass of Query). When I  
> search for some words, it works fine, but with others it crashes (bus  
> error). 

> The attached script crashes during the search, 

Crap. :(

This is a refcounting error, and while it's easy to fix in your module, a
general fix is going to be costly.

The immediate cause is that you're passing in a new Tally object each time,
rather than keeping a reference around.  What you should do instead is cache a
Tally object at object construction and just mod its score.  That's faster,
because there's less object creation/destruction -- and in this case, it will
stop the segfaulting.

The root of the problem is that you're passing in the last refcount for that
Tally object -- but the callback for Matcher_Tally() doesn't know that it needs
to take ownership of that refcount right away.  Perl "mortalizes" that scalar,
and once that happens, it becomes the walking dead -- it has an SvREFCNT of 0,
but destruction is getting delayed for a little bit so that the caller can
decide what it wants to do with it.  Unfortunately, the callback code isn't
claiming it in time, DESTROY is getting called, and when the calling code tries
to access members within the Tally struct, it's too late -- that memory has
already been freed.

The most "correct" fix is to have Matcher_Tally() return an "incremented
Tally*" rather than a "Tally*".  That causes the autogenerated callback code to
invoke INCREF before the Perl callback closes, ensuring that the object will
live on afterwards.

But then everything that calls Matcher_Tally() will have to take ownership of
a refcount, and Tally() is an inner loop method so we don't want to be
constantly incrementing and decrementing refcounts.  :(

This was an unanticipated use case because the code in RegexpTermQuery::tally()
is wasteful.  It never occurred to me that someone would return a new Tally()
on every scoring loop.  But explaining why you need to avoid doing that
requires explaining an excriciatingly complex memory managment issue; we might
get away with it this time, but it's just not acceptable to guard against this
class of error with documentation rather than code.

For Lucene people reading along wondering what on earth a "Tally" is... it's a
struct that conveys scoring information.  Tally has a float "score" member
which is used by the default scoring system.  

There's a small cost involved with dereferencing "tally->score" instead of
using "score" directly, naturally.  [Cue Mike McCandless frowning in
disapproval.]  However, because Tally is a struct, it's extensible -- so it
can hold information about e.g. what positions matched.  That, in turn, makes
it possible to e.g. implement the proximity boosting algorithm that the Page
and Brin described 11 years ago in their seminal paper.  

It's ticked me off for a long time that we throw positional information from
the query string away; replacing Scorer.score() with Matcher_Tally() was
supposed to open the door to exploiting that information and improving
relevance -- for once.  One easy fix to the problem exposed by WildCardQuery
would be to completely zap Tally() and go back to the tried and true Score() --
but that means that we stay mired a decade behind in our exploitation of
well-known IR retrieval algorithms.  I don't want to do that, because there are
all kinds of things we can do to optimize for speed, but very few opportunities
for improving IR precision.  Crap Crap Crap.

> but if I  uncomment the ‘use KSx::Search::WildCardQuery’ it crashes while
> parsing the query. 

KS was trying to deserialize your Schema, which involves deserializing your
Analyzers.  It didn't find KSx::Analysis::StripAccents, so the deserializer
returned the wrong kind of object (a Hash).  The class conflict wasn't
detected by PolyAnalyzer's autogenerated Load() method, so a segfault happened
later when KS tried to call an Analyzer method on a Hash.

Fixed by r4791 and r4792 -- now we get a sensible class-not-loaded error
message.

> BTW, I’m using rev. 4669.

FYI, I just today changed QueryParser's factory method names from e.g.
"make_termquery" to "make_term_query".  (: Seems like the right move since
you'd accidentally used the new spelling in your post. :) You'll need to
change those in your QueryParser subclasses.

Also, Scorer has been replaced by Matcher and Compiler_Make_Scorer() has been
replaced by Compiler_Make_Matcher().  I'd put some compatibility stubs into KS
trunk specifically so that the WildCardQuery 0.03 distro would still pass its
tests, but it turns out that you're going to need to release a new version
anyway.  Diff below.

Marvin Humphrey


--- KSx-Search-WildCardQuery-0.03-old/lib/KSx/Search/RegexpTermQuery.pm 
2009-05-30 12:16:18.000000000 -0700
+++ KSx-Search-WildCardQuery-0.03/lib/KSx/Search/RegexpTermQuery.pm 2009-06-13 
21:30:33.000000000 -0700
@@ -169,13 +169,13 @@
         = $raw_impact{$self} * $idf{$self} * $query_norm_factor;
 }
 
-sub make_scorer {
-    my ( $self, $reader ) = @_;
+sub make_matcher {
+    my $self = shift;
 
     return KSx::Search::RegexpTermScorer->new(
 #        posting_lists => $plists{$self},
-        similarity    => $self->get_similarity,
-   weight        => $self,
+        @_,
+        compiler        => $self,
     );
 }
 
@@ -212,26 +212,29 @@
 
 
 package KSx::Search::RegexpTermScorer;
-use base 'KinoSearch::Search::Scorer';
+use base 'KinoSearch::Search::Matcher';
 
 use KinoSearch::Search::Tally;
 
 use Hash::Util::FieldHash::Compat 'fieldhashes';
-fieldhashes\my(  %doc_nums, %pos, %wv,  %sim, %weight );
+fieldhashes\my(  %doc_nums, %pos, %wv,  %sim, %compiler, %tally );
 
 sub new {
    my ($class, %args) = @_;
 #  my $plists = delete $args{posting_lists};
-   my $weight = delete $args{weight};
-   my $self   = $class->SUPER::new(%args);
-   $sim{$self} = $args{similarity};
+    my $compiler   = delete $args{compiler};
+    my $reader     = delete $args{reader};
+    my $need_score = delete $args{need_score};
+    my $self       = $class->SUPER::new(%args);
+    $sim{$self}    = $compiler->get_similarity;
 
-   my $tfs = $tfs{$weight};
+    my $tfs = $tfs{$compiler};
    $doc_nums{$self} = [ sort { $a <=> $b } keys %$tfs ];
    
    $pos{$self} = -1;
-   $wv {$self} = $weight->get_value;
-   $weight{$self} = $weight;
+    $wv {$self} = $compiler->get_value;
+    $compiler{$self} = $compiler;
+    $tally{$self} = KinoSearch::Search::Tally->new;
    
    $self
 }
@@ -256,10 +259,10 @@
    my $doc_nums = $doc_nums{$self};
    return unless $pos < scalar @$doc_nums;
 
-   (my $tally = KinoSearch::Search::Tally->new)
+   (my $tally = $tally{$self})
     ->set_score(
        $wv{$self} * $sim{$self}->tf(
-           $tfs{$weight{$self}}{$$doc_nums[$pos]}
+           $tfs{$compiler{$self}}{$$doc_nums[$pos]}
        )
     );
 

Reply via email to