Oh, good luck on this! I've had similar issues and have just thrown up my hands. How do you expect to be able to correlate a word in the index with the bounding box in the OCR? I'm not sure this is a solved problem unless your OCR is *very* regular and clean. Even if you can calculate the ordinal position of the word, you'd be hosed if the OCR image was, say, slightly tilted at scan time.
Or do you have information about where on the page every word is that you somehow store and retrieve for highlighting purposes? Because I don't think this is a problem that Lucene/Solr can solve, unless I just completely fail to understand things. Which wouldn't be the first time... Unless you have data telling you where each word appears in the original OCR, I don't know how you'd go from knowing that a word appeared on a page to being able to calculate its bounding box. And if you *do* have this info, you don't need Lucene/Solr to know what to highlight, all you need is to know which pages which words appear on. Which is a non-trivial thing to get back from Lucene/Solr, but at least that's do-able. Or I'm just completely off base here..... Best Erick On Nov 30, 2007 4:02 PM, Owens, Martin <[EMAIL PROTECTED]> wrote: > > Hello everyone, > > We're working to replace the old Linux version of dtSearch with > Lucene/Solr, using the http requests for our perl side and java for the > indexing. > > The functionality that is causing the most problems is the highlighting > since we're not storing the text in solr (only indexing) and we need to > highlight an image file (ocr) so what we really need is to request from solr > the word indexes of the matches, we then tie this up to the ocr image and > create html boxes to do the highlighting. > > The text is also multi page, each page is seperated by Ctrl-L page breaks, > should we handle the paging out selves or can Solr tell use which page the > match happened on too? > > Thanks for your help, > > Best Regards, Martin Owens >