On Nov 13, 2007, at 7:21 AM, Cláudio Fernandes wrote:
Hello all,
I don't know if this is a somehow naive question, but here we go:
Does Lucene support index by sections? Like having a text document
with
three sections divided by XML tags indexed in a way we could do a
search
by work and section. Does Lucene itself support this kind of
indexing or
should it be used with other engines like Cocoon?
Thanks in advance for your time,
Depends on what you mean by sections.
If your document divides up simply into fixed fields:
<title>...</title>, <author>...</author> , <body>...</body>
or: <part1>...</part1>, <part2>...</part2>, <part3>...</part3>
then you can make those into fields of your lucene index.
But if there aren't a fixed number of sections, then fields probably
won't
work. Lucene doesn't itself handle nesting or inclusion, so finding
text within some arbitrary div or finding the div holding the text
is not so straightforward. However, lucene has a flexible notion
of what a 'document' is. ( Basically, it's whatever unit you feed
it as a document. ) So if this is what you need, you might be able
to make each <div> into a "document" rather than each file.
If you were indexing a large TEI text and wanted to return a
particular
chapter where the text was found, you could make each chapter a
'document',
and each document would have indexed fields to store the common header
info as well as the file name containing the chapter.
Lucene is great at finding documents, but not quite as good at finding
things IN documents. The index contains pointers to the terms, but
they are
pointers to a token in the parsed token stream, so to find a
character index
into a file, you have to (I believe) run the text thru the tokenizer
again.
( But lucene API gives you access to everything, even if it's not
simple or easy.
I think there are some new features in the latest version that can
make this
sort of thing easier, but I haven't yet figured out how to use
them. )
-- Steve Majewski
( Not much of a lucene expert, but I've spent some time figuring out
the difference
between document indexers like lucene and text indexers like xpat/
opentext. )
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]