Oh, good luck on this! I've had similar issues and have just thrown up my
hands. How do you expect to be able to correlate a word in the index
with the bounding box in the OCR? I'm not sure this is a solved problem
unless your OCR is *very* regular and clean. Even if you can calculate
the ordinal position of the word, you'd be hosed if the OCR image was,
say, slightly tilted at scan time.

Or do you have information about where on the page every word is that
you somehow store and retrieve for highlighting purposes?

Because I don't think this is a problem that Lucene/Solr can solve, unless
I just completely fail to understand things. Which wouldn't be the first
time...

Unless you have data telling you where each word appears in the original
OCR,
I don't know how you'd go from knowing that a word appeared on a page to
being able to calculate its bounding box. And if you *do* have this info,
you
don't need Lucene/Solr to know what to highlight, all you need is to know
which
pages which words appear on. Which is a non-trivial thing to get back from
Lucene/Solr, but at least that's do-able.

Or I'm just completely off base here.....

Best
Erick




On Nov 30, 2007 4:02 PM, Owens, Martin <[EMAIL PROTECTED]> wrote:

>
> Hello everyone,
>
> We're working to replace the old Linux version of dtSearch with
> Lucene/Solr, using the http requests for our perl side and java for the
> indexing.
>
> The functionality that is causing the most problems is the highlighting
> since we're not storing the text in solr (only indexing) and we need to
> highlight an image file (ocr) so what we really need is to request from solr
> the word indexes of the matches, we then tie this up to the ocr image and
> create html boxes to do the highlighting.
>
> The text is also multi page, each page is seperated by Ctrl-L page breaks,
> should we handle the paging out selves or can Solr tell use which page the
> match happened on too?
>
> Thanks for your help,
>
> Best Regards, Martin Owens
>

Reply via email to