Re: Highlighting of original documents

Oystein Reigem Tue, 13 Mar 2007 08:22:49 -0800

Mark Miller wrote:

Depends on the work you want to do. If you want to highlight a simpleXML doc the approach would be to extract all of the text elements andrun them through the highlighter and then correctly update them. Thatwould be mostly simple DOM manipulation.

OK.

I guess there will be some details that need special attention. One casethat springs to mind is the occurrence of words that in the originaldocument are broken up by encoding, like "en<hyphen/>coding" or"<em>mid</em>range".

The same approach should work with any format but the difficulty inmodifying the text may increase. If you can pull the text outappropriately it would seem you could put it back in though, or modifyit in place as you might with the DOM.

Do you know if tools (classes) for "appropriate" extraction from "my"file formats already exist in Lucene? I.e, something that not justextracts the text, but keeps track of its position in the original?

I saw POI <http://jakarta.apache.org/poi/> mentioned in a posting onthis list. Perhaps a solution for Word documents can be based on POI.


- Øystein -

- Mark

Oystein Reigem wrote:
Hi,
I want to implement fulltext search on a collection of documents. Itry to figure out which system is the better choice - eXist, orLucene, or some combination of the two. I have some knowledge ofeXist, but don't know too much about Lucene.
I'd like to display the result of a search as a list ofexcerpts/snippets with highlighted search words. When the user clicksan item in the result list to bring up the document in full, I'd liketo have search words highlighted in the full document as well.
The document collection is very diverse. There are pure textdocuments and well-formed XML and HTML documents, but unfortunatelyalso HTML documents that are not quite well-formed, Word documentsand PDFs. Many of the formats go beyond what eXist and Lucene canhandle, and I realise some conversion, or text extraction, isnecessary. As far as I know Lucene can only index and search puretext (and fields), so the documents must be run through appropriatefilters extracting the text (and field values). Afterwards fulltextsearch is possible.
But what about highlighting? I know it is possible to gethighlighting in the pure text version, but what about the originaldocument, when the original document is something else than puretext, e.g, a simple XML document? Is it at all possible to get thesearch words tagged in the XML document?
I assume not, but ask anyway. :-)

Cheers,

- Øystein -
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL 
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL 
PROTECTED]>. Aksis home page: <www.aksis.uib.no>.

Re: Highlighting of original documents

Reply via email to