I have a bunch of PDF files that have had an OCR package run against them. The problem is that it adds the text to the normal Page content, and tries to position the recognized text at the location in the image it was found. So the text is mixed with lots of positioning, etc.. information. I'd like to extract all the text as a block of text, and just add it all as a single item. Probably an annotation. There are lots of tools to extract text from a PDF - but they are all web based, or use a GUI to do one file at a time. I want to just run this against a directory full of PDF's and have it do all of them.
Anyone know of such a tool? Have one written?
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org