Mark Miller wrote:
Depends on the work you want to do. If you want to highlight a simple
XML doc the approach would be to extract all of the text elements and
run them through the highlighter and then correctly update them. That
would be mostly simple DOM manipulation.
OK.
I guess there will be some details that need special attention. One case
that springs to mind is the occurrence of words that in the original
document are broken up by encoding, like "en<hyphen/>coding" or
"<em>mid</em>range".
The same approach should work with any format but the difficulty in
modifying the text may increase. If you can pull the text out
appropriately it would seem you could put it back in though, or modify
it in place as you might with the DOM.
Do you know if tools (classes) for "appropriate" extraction from "my"
file formats already exist in Lucene? I.e, something that not just
extracts the text, but keeps track of its position in the original?
I saw POI <http://jakarta.apache.org/poi/> mentioned in a posting on
this list. Perhaps a solution for Word documents can be based on POI.
- Øystein -
- Mark
Oystein Reigem wrote:
Hi,
I want to implement fulltext search on a collection of documents. I
try to figure out which system is the better choice - eXist, or
Lucene, or some combination of the two. I have some knowledge of
eXist, but don't know too much about Lucene.
I'd like to display the result of a search as a list of
excerpts/snippets with highlighted search words. When the user clicks
an item in the result list to bring up the document in full, I'd like
to have search words highlighted in the full document as well.
The document collection is very diverse. There are pure text
documents and well-formed XML and HTML documents, but unfortunately
also HTML documents that are not quite well-formed, Word documents
and PDFs. Many of the formats go beyond what eXist and Lucene can
handle, and I realise some conversion, or text extraction, is
necessary. As far as I know Lucene can only index and search pure
text (and fields), so the documents must be run through appropriate
filters extracting the text (and field values). Afterwards fulltext
search is possible.
But what about highlighting? I know it is possible to get
highlighting in the pure text version, but what about the original
document, when the original document is something else than pure
text, e.g, a simple XML document? Is it at all possible to get the
search words tagged in the XML document?
I assume not, but ask anyway. :-)
Cheers,
- Øystein -
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL
PROTECTED]>. Aksis home page: <www.aksis.uib.no>.