Hi Marvin,

On 25 Feb 2011, at 4:12 PM, Marvin Humphrey wrote:

> On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:
>> Well, actually, I want it for more than that.  For my particular needs, I
>> need to get the field name where the match occurred in the document, and
>> then I'd ideally like to have the start offset into that field and the
>> length of the match.
> 
>> The Ferret::Search::Hit gives me the document number and the score, but
>> that's it.  In whatever list format the results are actually in, I'd also
>> like to have the information I mentioned.  If you weren't storing the offset
>> information, then it would make sense for it not to be available, but if you
>> were, then I'd expect to have the whole thing right there.  I can't see how
>> there'd be a performance issue in providing this information.
> 
> You have to generate that information after the fact, by post-processing the
> Hits that come back.  Lucy, Lucene, and Ferret all have the same behavior in
> this regard.
> 
> Matching and scoring are highly abstracted for speed.  The matching engine
> does not scan raw document content, a la an RDBMS full table scan -- instead,
> it iterates over heavily optimized data structures devoid of introspection
> overhead.  At the end of a search, you will only have documents and scores --
> not sophisticated metadata about what part of the subquery matched and what
> parts didn't and how much each matching part contributed to the score.
> Keeping track of such metadata during the matching phase would be
> prohibitively expensive.

I can understand the need to abstract a lot of things for speed.  I'm no search 
expert as I've said before, but I don't understand why at the very least the 
field information (e.g. name) can't be encoded in this data structure in such a 
way that you can determine this information at match time.  Highlighting and 
offsets are a different matter, and I never thought it was doing a full-text 
scan or a table scan like an RDBMS.  If I wanted that, I'd just use regex 
searches (which I do in some cases for small datasets).

Obviously, I'm missing something here, but to me I don't see why it matters to 
keep track of fields at all if you don't have the information about which field 
matched an "all fields" or "multiple field" search query to hand when you get 
the match information back in terms of term and field.  Obviously, actually 
finding the offsets is a much more expensive operation, and I'm ok with having 
to do that after the search is completed--even if I have to do my own matching 
without API support for highlighting.  However, this is only possible if I know 
what term and what field and don't have to effectively perform the search again 
on the document (which is what Ferret seems to require).

> In Lucy, our highlighting capabilities are powered by the Highlight_Spans()
> method, which is invoked on a derivative of the Query object:
> 
>    /** Return an array of Span objects, indicating where in the given
>     * field the text that matches the parent query occurs.  In this case,
>     * the span's offset and length are measured in Unicode code points.
>     * The default implementation returns an empty array.    
>     *   
>     * @param searcher A Searcher.
>     * @param doc_vec A DocVector.
>     * @param field The name of the field.
>     */  
>    public incremented VArray*
>    Highlight_Spans(Compiler *self, Searcher *searcher, 
>                    DocVector *doc_vec, const CharBuf *field);
> 
> Perhaps that might be of use for you.

This API has the same problem as Ferret--if I don't know what field, then I've 
got to try all the fields (maybe > 20 in some cases) on the document.  If you 
need this information to display to users, then it doesn't matter how fast the 
search is if you're going to slow down the whole interaction by checking 
between 2-x fields * the number of matches in the results chunk you're 
processing.

The advantages of the fulltext search capabilities exposed via a query language 
like FQL or whatever Lucy uses is that you can effectively defer all of the 
introspection/heavy lifting of the searching and results matching to the 
underlying fulltext system (or, at least that's the way I see it).  If you then 
don't have enough information available to fully describe the matches in an 
efficient way, then the only other option you have is to both pre-process the 
query to see if any explicit fields are present, and then, if not, try all of 
the fields indexed to see if they happen to match (effectively performing the 
search again over the result set).

Maybe I'm using it wrong, or maybe I just don't get it, but these are the kinds 
of things I need to do.

[snip]

>> I tried to dig through the lucy SVN repository via the web UI, but I
>> couldn't really figure out what's there.  The code generator framework
>> you're using is something I haven't seen before, but at least it explains
>> why I couldn't find the Ruby bindings! :)
> 
> There's a short high-level introduction to the Lucy codebase here:
> 
>  
> http://svn.apache.org/repos/asf/incubator/lucy/trunk/core/Lucy/Docs/DevGuide.cfh

You weren't kidding about the "short" part! :)  Still, thanks for the pointer.  
I'd seen it earlier.

>> Presently tinkering with the Ferret internals since it seems like  there
>> ought to be a way to expose what I want (it's in the explain output)
> 
> That might work.  Most people use the Explanation API for tuning and
> troubleshooting, though; it might prove a little expensive or unwieldy for
> what you're doing.

After spending about 12-14 hours trying to get my head around the code and the 
way the searching worked, I gave up.  There wasn't a good, consistent API 
abstraction that allowed you to access the same information from the internals 
of the search code that were leveraged by the explain code--and the fact that 
explain is overloaded for each subclass, but not in a universal way would've 
required more surgery than I was prepared to do at the C level given the time I 
have.  Jens took an alternative approach and implemented some changes at the 
Ruby level.  These helped, but they  still required some tweaking to be used by 
both the Searcher API and the Index API since again, some of the information 
available for the index isn't available for the Searcher API.

For now, thanks to Jens' patch, I have the capability to do what I need to do 
with Ferret--even if it isn't as fast as it could be.  However, unless the same 
type of information is exposed at an API level in Lucy, the same kinds of 
workarounds would be required to use Lucy instead of Ferret for my application.

At least my Wed turned out not to be a total wasted day after all! :)

Cheers for all the information,

ast
--
Andrew S. Townley <[email protected]>
http://atownley.org

Reply via email to