Re: Indexing and Hit Highlighting OCR Data

Erik Hatcher Fri, 03 Jun 2005 02:49:22 -0700


On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote:

This is a pretty interesting problem.  I envy you.
I would avoid the existing highlighter for your purposes --highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- bywhich i
mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
1) a stream of words for analysis/filtering (ie: stop words,stemming,
       synonyms)
    2) a datastructure mapping words to lists of positions (ie: if the
same word apears in multiple places, list the word once,followed
       by each set of coordinates)
use #1 in the usual way, and add a serialized form of #2 to yourindex asa Stored Keyword -- at query time, the words from your initialquery can
be looked up in that data strucutre to find the regions to "highlight"

Chris - that is great recommendation. I second it. The only minorthing I'll add is that you probably should use an unindexed field for#2 rather than literally a Field.Keyword - no point in indexing it asyou would never search on that data structure.


    Erik

: I am involved in a project which is trying to provide searchingand hit highlighting on the scanned image of historicalnewspapers. We have an XML based OCR format. A sample is below.We need to index the CONTENT attribute of the String element whichis the easy part. We would like to be able find the "hits" withinthis XML document in order to use the positioning information todraw the highlight boxes on the image. It doesn't make a lot ofsense to just extract the CONTENT and index that because we loosethe positioning information. My second thought was to make acustom analyzer which dropped everything except for the contentelement and then used the highlighting class in the sandbox toreanalyze the XML document and mark the hits. With the marked hitsin the XML we could find the position information and draw on theimage. Has anyone else worked with OCR information and lucene.What was your approach? Does this approach seem sound? Anyrecommendations?
:
: Thanks, Corey
:
: <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0"VPOS="123644.0">: <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0"HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>
:       <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
: <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0"HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>
:       <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
: <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0"HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>
:       <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
: <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0"HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>
:       <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
: <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0"HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>
:       <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
: <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0"HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
:       <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
:      </TextLine>
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing and Hit Highlighting OCR Data

Reply via email to