Cescy wrote:
Hi,
I am developing a pdf search engine, just use in local computer to search
massive pdf documents.
I used pdfbox+lucene to index and search, and then I have to display the
context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???
I have completed a project to do the exact same thing. I put the pdf
text in XML files. Then after I do a Lucene search I read the text from
the XML files. I do not store the text in the Lucene index. That would
bloat the index and slow down my searches. FYI -- I use PDFBox to
extract the "searchable" text and I use tesseract (OCR) to extract the
text from the images within the PDFs. In order to make tesseract work
correctly I have to use ImageMagick to do many modification to the
images so that tesseract can OCR them correctly. Image modification/OCR
is a slow process and it is extremely resource intensive (CPU
utilization specifically -- Disk IO to a lesser extent).
As far as displaying the extracted text I would use an AJAX framework
that would provide a nice pop-up view of the text. This pop-up should
also have built in paging. I use Lucene's built in hi-lighting of
matches as well.
Oh almost forgot -- I use PDFBox to extract the images from the PDFs.
James
THX
--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone: (505) 348-2081
Fax: (505) 348-2028
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org