James Wilson <james_wil...@nmcourt.fed.us> wrote: > I have completed a project to do the exact same thing. I put the pdf > text in XML files. Then after I do a Lucene search I read the text from > the XML files. I do not store the text in the Lucene index. That would > bloat the index and slow down my searches. FYI -- I use PDFBox to > extract the "searchable" text and I use tesseract (OCR) to extract the > text from the images within the PDFs. In order to make tesseract work > correctly I have to use ImageMagick to do many modification to the > images so that tesseract can OCR them correctly. Image modification/OCR > is a slow process and it is extremely resource intensive (CPU > utilization specifically -- Disk IO to a lesser extent).
I've built a pipeline in UpLib (open source at http://uplib.parc.com/) to extract both the page images and the text (along with wordboxes and font size, etc.) from PDFs, along with various metadata items. It also includes a converter (ToPDF) which will convert Web pages, Word, Powerpoint, email etc. to PDF first, and then do the extraction. uplib-add-document --noupload mydoc will create a temporary directory with all the pieces in it and output the name of that directory to stdout. > As far as displaying the extracted text I would use an AJAX framework > that would provide a nice pop-up view of the text. This pop-up should > also have built in paging. I use Lucene's built in hi-lighting of > matches as well. Actually, with HTML and CSS you can do just what "searchable PDF" does. Put up the text in an HTML file, using "span" tags with absolute positioning, and using the special color "transparent". Use CSS to make the page image the "background-image" for the HTML, and you have a browser-displayable object which looks like a page image with selectable text. Bill --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org