Re: Context specific summary with the search term
Lee Mallabone wrote: Okay, I'm now not entirely certain how useful a generic solution will be to me, given the non-generic nature of the content I'm indexing. I think there a lot of optomizations I can make that wouldn't be generic. Early optimization is the root of all evil. Seriously, though, one thing I see Doug say often is that lucene's indexing and searching are designed to be extremely fast. He often responds to questions about odd details - for example, the classic do a search and cache the search results for paging across multiple web pages - by saying to just use the brute force approach and rely on the speed of the lucene index. I like to say, I assume that there are people out there with a lot more on the ball than me about things like optimization. I try to use their brains as much as possible :-). For example, with compilers, I assume the compiler writer knew a lot more about optimization than I do. People talk about the compiler not having the human judgement to know what's best. That's true, but the way to deal with that is not to try to hand-optimize my code and outguess the compiler (which will only will only confuse the compiler and prevent it from doing what it was designed to do). The compiler can best optimize the program if I focus on making it clear what my intent is, what the program is meant to do, in the structure of the code first. This leads to another optimization slogan that I remember reading - algorithmic optimization is much better than spot optimization. In other words, before you try to figure out a faster way to do something, figure out if you're doing the thing that accomplishes your true goal in the fastest way. And figure out how important that thing is in the grand scheme of things. Steven J. Owens [EMAIL PROTECTED]
RE: Context specific summary with the search term
On Mon, 2001-10-22 at 17:43, Doug Cutting wrote: I'm trying to implement this and should be able to contribute any succesful results, but I need to produce context on a per-field basis. How did the title ever get indexed as the title? Presumably you split the document into fields when it was indexed. Similarly, if you re-tokenize things a field at a time then you should always know which field you are in, no? I'm indexing HTML documents marked up with comments to indicate field boundaries. So I'd typically have: !--field:section_title-- blurb !--field:text-- more blurb and so on. The documents were indexed by looking for each field marker and then adding the subsequent lines to the relevant field. In order to obtain a generic solution for context generation are you suggesting I write a method that takes plain text, (eg, text form of document) and a query, and assumes the plain text is in the query's default field? This doesn't seem quite as useful as getContext(Hashset queryTerms, Reader originalDocument); which is what I was originally aiming towards. Regards, -- Lee Mallabone
RE: Context specific summary with the search term
From: Lee Mallabone [mailto:[EMAIL PROTECTED]] I'm trying to implement this and should be able to contribute any succesful results, but I need to produce context on a per-field basis. Eg. if I got a token hit in the text body of a document, but the first hit token was a word in the section title, I'd want to generate context around the token in the text body. How did the title ever get indexed as the title? Presumably you split the document into fields when it was indexed. Similarly, if you re-tokenize things a field at a time then you should always know which field you are in, no? I had been using a TokenStream to try this. However, lucene's Token class doesn't seem to have any concept of fields, (even when I tokenStream() a document that is in the index with a whole bunch of fields). Is there any reason for this? Moreover, any suggestions of how to find the information I need? The natural thing seems to be to have a field-aware token stream, but I'm not sure how I'd go about implementing that... Regards, -- Lee Mallabone