Hello,

Our pages are images of handwritten text in Arabic so OCR'ing is not possible. 
We will be extracting the text during pre-processing and storing the words and 
(x, y) coordinates in a database. Would your process apply to our images?

> Step 1:
> For sending the extracted text content from text pdf to solr, use a low level
> pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get
> the coordinates and page no. of each word. Store it in a seperate file as word
> map. This word map will contain page+coordinates mapping to occurence
> number for word.

Can we generate a word map manually? Is this used by Solr and requires a 
specific format?

> Step 2:
> Solr highlighter needs to be changed to get the word and their occurence
> number in the text document, rather than the character offsets for each hit.

How is this done? I read the solr highlighting wiki, but don't see how this can 
be done.

> Step 3:
> Combine the solr output to the word map created in step 1 and the pdf page
> and coordinates can be generated for original pdf docuemnt which can be
> highlighted by any viewer.

Can I get more information about how to do this?

Thanks!

Reply via email to