Re: Lucene search question

Steven D. Majewski Tue, 13 Nov 2007 09:00:06 -0800


On Nov 13, 2007, at 7:21 AM, Cláudio Fernandes wrote:

Hello all,

I don't know if this is a somehow naive question, but here we go:
Does Lucene support index by sections? Like having a text documentwiththree sections divided by XML tags indexed in a way we could do asearchby work and section. Does Lucene itself support this kind ofindexing or
should it be used with other engines like Cocoon?

Thanks in advance for your time,


Depends on what you mean by sections.
If your document divides up simply into fixed fields:
     <title>...</title>, <author>...</author> , <body>...</body>
or:  <part1>...</part1>, <part2>...</part2>, <part3>...</part3>
then you can make those into fields of your lucene index.

But if there aren't a fixed number of sections, then fields probablywon't

work. Lucene doesn't itself handle nesting or inclusion, so finding
text within some arbitrary div or finding the div holding the text
is not so straightforward. However, lucene has a flexible notion
of what a 'document' is. ( Basically, it's whatever unit you feed
it as a document. ) So if this is what you need, you might be able
to make each <div> into a "document" rather than each file.

If you were indexing a large TEI text and wanted to return aparticularchapter where the text was found, you could make each chapter a'document',

and each document would have indexed fields to store the common header
info as well as the file name containing the chapter.

 Lucene is great at finding documents, but not quite as good at finding

things IN documents. The index contains pointers to the terms, butthey arepointers to a token in the parsed token stream, so to find acharacter indexinto a file, you have to (I believe) run the text thru the tokenizeragain.( But lucene API gives you access to everything, even if it's notsimple or easy.I think there are some new features in the latest version that canmake thissort of thing easier, but I haven't yet figured out how to usethem. )

-- Steve Majewski( Not much of a lucene expert, but I've spent some time figuring outthe differencebetween document indexers like lucene and text indexers like xpat/opentext. )








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene search question

Reply via email to