On 04/11/14 12:17, Gary Roach wrote: > On 11/01/2014 06:35 PM, Scott Ferguson wrote: >> On 31/10/14 11:47, Gary Roach wrote: >>> Hi all, >>> >>> Problem: I am working on an archiving project and wish to archive >>> documents to searchable pdf files but can't seem to figure out how to >>> proof read and correct the text overlay. Any suggestions.
<snipped> >> >> > This whole process is new to me and I am struggling to get my feet on > the ground. I /thought/ I knew what I was up against when I first worked with Tesseract[*1] - given my previous experience with several very large OCR projects. Wrong! :( Then, after my first Tesseract OCR project I /thought/ I was better informed[*2].... (sigh). :) Hence my questions about constraints. [*1] built my own auto-book scanner [*2] worked on a project where volunteers had previously "scanned" documents and "tried" to use Tesseract. :/ > I just came to the same conclusion about trying to proof > pdf's instead of using the raw tiff files. Thank you for the list of > alternatives to Tesseract. They are not "alternatives" to Tesseract - just alternative "interfaces" to the Tesseract engine. > Iwill check them out. I am a bit unsure about > the "Tesseract tool set" and need to do more research into this area. > One of the hardest things about developing an new skill set for > computers is finding the correct software and documentation. I'm still > working on this. Though I don't know the specifics of the project, may I suggest, resources allowing, the following approach:- ;scan the pages as high-quality PNG images - keep the PNG originals[*1] ;try various processing methods before converting to TIFF (to get the clearest separation of 2 colours) ;keep track of the various image versions[*1] - you'll find the scan/convert/OCR/edit process is iterative ;feed TIFF to tesseract using the management interface of your choice - create an index of the fonts used in the books you are processing, if there's more than a couple of pages of a font-type spend some time on teaching tesseract the font (much quicker than post-editing every miss-read). [*1] Especially useful for last edit layout checking. [*2] I found Digikam invaluable for this purpose. > > Thanks > > Gary R. > > Hope that helps, I'd be very interested in the outcome if you wouldn't mind contacting me offlist. Kind regards -- "Turns out you can't back a winner in the Gish Gallop" ~ disappointed punter -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/5459729b.8010...@gmail.com