Re: Context specific summary with the search term

2001-10-26 Thread Steven J. Owens

Lee Mallabone wrote:
 Okay, I'm now not entirely certain how useful a generic solution will be
 to me, given the non-generic nature of the content I'm indexing. I think
 there a lot of optomizations I can make that wouldn't be generic.

 Early optimization is the root of all evil.  

 Seriously, though, one thing I see Doug say often is that
lucene's indexing and searching are designed to be extremely fast.  He
often responds to questions about odd details - for example, the
classic do a search and cache the search results for paging across
multiple web pages - by saying to just use the brute force approach
and rely on the speed of the lucene index.  

 I like to say, I assume that there are people out there with a
lot more on the ball than me about things like optimization.  I try to
use their brains as much as possible :-).  For example, with
compilers, I assume the compiler writer knew a lot more about
optimization than I do.  People talk about the compiler not having the
human judgement to know what's best.  That's true, but the way to deal
with that is not to try to hand-optimize my code and outguess the
compiler (which will only will only confuse the compiler and prevent
it from doing what it was designed to do).  The compiler can best
optimize the program if I focus on making it clear what my intent is,
what the program is meant to do, in the structure of the code first.

 This leads to another optimization slogan that I remember reading
- algorithmic optimization is much better than spot optimization.  In
other words, before you try to figure out a faster way to do
something, figure out if you're doing the thing that accomplishes your
true goal in the fastest way.  And figure out how important that thing
is in the grand scheme of things.

Steven J. Owens
[EMAIL PROTECTED]



RE: Context specific summary with the search term

2001-10-23 Thread Lee Mallabone

On Mon, 2001-10-22 at 17:43, Doug Cutting wrote:
  I'm trying to implement this and should be able to contribute any
  succesful results, but I need to produce context on a per-field basis.
 
 How did the title ever get indexed as the title?  Presumably you split the
 document into fields when it was indexed.  Similarly, if you re-tokenize
 things a field at a time then you should always know which field you are in,
 no?

I'm indexing HTML documents marked up with comments to indicate field
boundaries. So I'd typically have:

!--field:section_title--
blurb
!--field:text--
more blurb

and so on. The documents were indexed by looking for each field marker
and then adding the subsequent lines to the relevant field.

In order to obtain a generic solution for context generation are you
suggesting I write a method that takes plain text, (eg, text form of
document) and a query, and assumes the plain text is in the query's
default field?

This doesn't seem quite as useful as getContext(Hashset queryTerms,
Reader originalDocument); which is what I was originally aiming towards.

Regards,

-- 
Lee Mallabone




RE: Context specific summary with the search term

2001-10-22 Thread Doug Cutting

 From: Lee Mallabone [mailto:[EMAIL PROTECTED]]
 
 I'm trying to implement this and should be able to contribute any
 succesful results, but I need to produce context on a per-field basis.
 Eg. if I got a token hit in the text body of a document, but the first
 hit token was a word in the section title, I'd want to 
 generate context
 around the token in the text body.

How did the title ever get indexed as the title?  Presumably you split the
document into fields when it was indexed.  Similarly, if you re-tokenize
things a field at a time then you should always know which field you are in,
no?

 I had been using a TokenStream to try this. However, lucene's Token
 class doesn't seem to have any concept of fields, (even when I
 tokenStream() a document that is in the index with a whole bunch of
 fields). Is there any reason for this? Moreover, any 
 suggestions of how
 to find the information I need?
 
 The natural thing seems to be to have a field-aware token stream, but
 I'm not sure how I'd go about implementing that...
 
 Regards,
 
 -- 
 Lee Mallabone