Hello, Our pages are images of handwritten text in Arabic so OCR'ing is not possible. We will be extracting the text during pre-processing and storing the words and (x, y) coordinates in a database. Would your process apply to our images?
> Step 1: > For sending the extracted text content from text pdf to solr, use a low level > pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get > the coordinates and page no. of each word. Store it in a seperate file as word > map. This word map will contain page+coordinates mapping to occurence > number for word. Can we generate a word map manually? Is this used by Solr and requires a specific format? > Step 2: > Solr highlighter needs to be changed to get the word and their occurence > number in the text document, rather than the character offsets for each hit. How is this done? I read the solr highlighting wiki, but don't see how this can be done. > Step 3: > Combine the solr output to the word map created in step 1 and the pdf page > and coordinates can be generated for original pdf docuemnt which can be > highlighted by any viewer. Can I get more information about how to do this? Thanks!