Re: Define end-of-paragraph

Chuck Williams Tue, 03 Oct 2006 14:39:28 -0700

Hi Reuven,

In my haste last night, I pointed you at the wrong fields on Token. You
need to set the position to create inter-paragraph gaps, not the
offsets, so you want Token.setPositionIncrement() for that approach, or
Analyzer.getPositionIncrementGap() if you use the multi-field approach.


You will likely have performance problems with Documents that have
thousands of fields, so I would not recommend that approach. Are you
only matching paragraphs rather than whole documents? If so, another
approach would be to make each paragraph a separate document. Then you
could store document and paragraph id's in separate fields and have all
the information you want.

If you need whole document matching, but want the paragraph number of
matches, one approach might be to use SpanQuery's together with a
position-encoding of paragraph numbers. E.g., place you paragraphs
starting at positions 0, 10000, 20000, 30000, ... Then from the
positions on the spans you find, you can identify what paragraph you are in.

I'm sure you can come up with many other ways to represent this
information as well.

Hope this helps,

Chuck


Reuven Ivgi wrote on 10/02/2006 11:27 PM:
> Hello,
> To be more precise, the basic entity I am using is a document, each with
> paragraphs which may be up to few thousands. I need the proximity search
> within a paragraph, yet, I want to get as a search result the paragraph
> number also. Maybe, defining each paragraph as separate field it the
> best way
> What do you think?
> Thanks in advance 
>
> Reuven Ivgi
>
> -----Original Message-----
> From: Chuck Williams [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, October 03, 2006 10:58 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Define end-of-paragraph
>
>
> Reuven Ivgi wrote on 10/02/2006 09:32 PM:
>   
>> I want to divide a document to paragraphs, still having proximity
>>     
> search
>   
>> within each paragraph
>>
>> How can I do that?
>>   
>>     
>
> Is your issue that you want the paragraphs to be in a single document,
> but you want to limit proximity search to find matches only within a
> single paragraph?  If so, you could parse your document into paragraphs
> and when generating tokens for it place large gaps at the paragraph
> boundaries.  Each Token in lucene has a startOffset and endOffset that
> you can set as you generate Tokens inside TokenStream.next() for the
> TokenStream returned by your Analyzer.  Those classes and methods are
> all in org.apache.lucene.analysis.  Or alternatively, you could make
> each paragraph a separate field value and use
> Analyzer.getPositionIncrementGap() to achieve essentially the same thing
> (except that your Documents could get unwieldy if you that have many
> paragraphs).
>
> If this is not what you are trying to do, then please explain your
> objectives precisely.
>
> Good luck,
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Define end-of-paragraph

Reply via email to