Hi Reuven, In my haste last night, I pointed you at the wrong fields on Token. You need to set the position to create inter-paragraph gaps, not the offsets, so you want Token.setPositionIncrement() for that approach, or Analyzer.getPositionIncrementGap() if you use the multi-field approach.
You will likely have performance problems with Documents that have thousands of fields, so I would not recommend that approach. Are you only matching paragraphs rather than whole documents? If so, another approach would be to make each paragraph a separate document. Then you could store document and paragraph id's in separate fields and have all the information you want. If you need whole document matching, but want the paragraph number of matches, one approach might be to use SpanQuery's together with a position-encoding of paragraph numbers. E.g., place you paragraphs starting at positions 0, 10000, 20000, 30000, ... Then from the positions on the spans you find, you can identify what paragraph you are in. I'm sure you can come up with many other ways to represent this information as well. Hope this helps, Chuck Reuven Ivgi wrote on 10/02/2006 11:27 PM: > Hello, > To be more precise, the basic entity I am using is a document, each with > paragraphs which may be up to few thousands. I need the proximity search > within a paragraph, yet, I want to get as a search result the paragraph > number also. Maybe, defining each paragraph as separate field it the > best way > What do you think? > Thanks in advance > > Reuven Ivgi > > -----Original Message----- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Tuesday, October 03, 2006 10:58 AM > To: java-dev@lucene.apache.org > Subject: Re: Define end-of-paragraph > > > Reuven Ivgi wrote on 10/02/2006 09:32 PM: > >> I want to divide a document to paragraphs, still having proximity >> > search > >> within each paragraph >> >> How can I do that? >> >> > > Is your issue that you want the paragraphs to be in a single document, > but you want to limit proximity search to find matches only within a > single paragraph? If so, you could parse your document into paragraphs > and when generating tokens for it place large gaps at the paragraph > boundaries. Each Token in lucene has a startOffset and endOffset that > you can set as you generate Tokens inside TokenStream.next() for the > TokenStream returned by your Analyzer. Those classes and methods are > all in org.apache.lucene.analysis. Or alternatively, you could make > each paragraph a separate field value and use > Analyzer.getPositionIncrementGap() to achieve essentially the same thing > (except that your Documents could get unwieldy if you that have many > paragraphs). > > If this is not what you are trying to do, then please explain your > objectives precisely. > > Good luck, > > Chuck > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]