Two things to watch... 1> Think about indexing the special page-end token with an increment gap of 0 (see SynonymAnalyzer in Lucene In Action). That preserves the sense of phrases across page breaks.
2> Assembling the span query is tricky. Search the mail archive for SpanQuery to see an exchange I had with the originator of this concept. Suffice it to say that converting an ad-hoc query into a set of SpanQueries is not trivial, but it certainly is do-able. But you'd have a much easier time of it if you were able to control the queries and dis-allow ad-hoc queries. It all depends upon the requirements of the application. Any time you can avoid supporting arbitrary boolean logic for the user input, your job is easier <G>.... But you should be able to run up a demo with simple queries that you control to prove out the methodology in any case..... Best Erick On 5/23/07, Andreas Guther <[EMAIL PROTECTED]> wrote:
Eric, Thank you very much for your response. That sounds very interesting. Let me do some experimenting to see if I fully understood your solution. Otherwise I have to come back to you with more questions. Andreas -----Original Message----- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 23, 2007 12:00 PM To: java-user@lucene.apache.org Subject: Re: How to filter fields with hits from result set As luck would have it, I've done something very similar. What I had to do is index a special token at the end of each page. Then I could get the term offsets for each page.... Then I used one of the SpanQuery.getSpans to get all of the offsets of the hits throughout all of the pages. now I have a list of all the offsets of the *last* term on each page and a list of the offsets of the hits. From these two lists I can know which pages have hits. Best Erick On 5/23/07, Andreas Guther <[EMAIL PROTECTED]> wrote: > > Hi, > > If a search returns a document that has multiple fields with the same > name, is there a way to filter only those fields that contain hits? > > > Background: > > I am indexing documents and we store all content in our index for > display reasons. We want to show only those pages containing hits. My > first implementation was saving each page in a Lucene document. For > performance reasons why are now looking into indexing the complete > indexed document as a single Lucene document. > > Every page is added to a field in the Lucene document named > page-content. That means I am ending with as many fields named > page-content as the document has pages. > > My search now returns me a single Lucene document in contrary to my > first approach with page per Lucene document. My problem right now is: > how can I limit the returned page-contents fields for pages to those > field entries that contain hits. If I have hits on pages five pages > from a document with 10 pages I would like to have only the pages with > the hits, not all. > > Is there anything in Lucene that limits the returned fields to fields > with hits only? > > Thanks in advance, > > Andreas > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]