Corey,
  I have one off the wall approach that may or may not work for you.
If you convert your scanned images to PDF then use something like
Acrobat to convert those PDFs into PDFs with hidden text (The OCR
data). You can then tell Acrobat Reader via XML what to highlight when
your user opens the PDF.
  Not sure if that helps you but may give you some alternate ideas.

Richard


On 6/2/05, Corey Keith <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I am involved in a project which is trying to provide searching and hit 
> highlighting on the scanned image of historical newspapers.  We have an XML 
> based OCR format.  A sample is below.  We need to index the CONTENT attribute 
> of the String element which is the easy part.  We would like to be able find 
> the "hits" within this XML document in order to use the positioning 
> information to draw the highlight boxes on the image.  It doesn't make a lot 
> of sense to just extract the CONTENT and index that because we loose the 
> positioning information.  My second thought was to make a custom analyzer 
> which dropped everything except for the content element and then used the 
> highlighting class in the sandbox to reanalyze the XML document and mark the 
> hits.  With the marked hits in the XML we could find the position information 
> and draw on the image.  Has anyone else worked with OCR information and 
> lucene.  What was your approach?  Does this approach seem sound?  Any 
> recommendations?
> 
> Thanks, Corey
> 
>      <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
>       <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" 
> VPOS="123644.0" CONTENT="The" WC="1.0"/>
>       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" 
> VPOS="123711.0" CONTENT="female" WC="1.0"/>
>       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
>       <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" 
> VPOS="123711.0" CONTENT="lays" WC="1.0"/>
>       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" 
> VPOS="123711.0" CONTENT="about" WC="1.0"/>
>       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" 
> VPOS="123770.0" CONTENT="140" WC="1.0"/>
>       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
>       <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" 
> VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
>       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
>      </TextLine>
> 
> 
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to