Antonio Diaz Diaz wrote: > Tony Maro wrote: > >> I've got a sample for you: >> In the center of the page in large font is "SEPARATOR PAGE". >> Not a single character is recognized, however it does try to interpret >> the barcode above it. >> I do know that if I crop the image to just the barcode and text, and >> remove all the whitespace it reads it fine. > > > The problem is a large black block to the right of the page. The image > goes beyond the sheet of paper on the scanner. > > The solution is easy, use the option `-l1' or `-l2' to remove the block. > > `ocrad -l1 page.pbm'
Ah, thank you, that explains it. There's not a way to limit the area of the page you're doing OCR on is there? Like a zone ocr? I'm going for speed. What I'm actually doing is trying to detect page rotation by doing OCR on the page one way, and if the ratio of letters to garbage isn't high enough I flip the page and OCR again. I've figured out I can do this on only around 1/4 of the page and get accurate results, and the OCR doesn't take as long. I really only need to OCR either the middle of the page or the top left quarter of the page. Unfortunately using ImageMagick is slow for cropping. *I'm using tiffsplit to split around 500 pages into single pages *I then call tifftopnm and convert a single page to PBM for processing *I then use ImagMagick convert to crop the pbm into a temp file *I run ocrad on it and check the produced text. *If the ratio of letters to garbage is greater than 1.8, I assume it's right and go on... *if not, I rotate the pbm and crop again with mogrify *run ocrad on the rotated pbm and compare the text again *If the ratio is better than the first try, I assume the page is upside down and rotate the original tiff page. *When done, I reassemble all the tiff pages using tiffcp I actually rotate the pbm rather than use the rotation in ocrad so I can grab the opposite corner of the document prior to cropping. There's generally more text in the top left corner of the page, and leads to more accurate results. For a single document of around 500 pages I've trimmed it down to just over 7 minutes to do the above checks, correct orientation of any pages and reassemble the multi-page TIFF. Accuracy of rotation detection is around 98% with reasonable quality scans at 200 dpi. Most pages that should have been rotated and are not are usually really bad quality to begin with. Out of 500 pages only 2 got rotated that should not have been. That's a huge boost considering around 40% of the pages are upside down when I start, and the original documents are in horrible shape. You couldn't even consider doing a true OCR and getting readable results on at least half the pages. Many are mostly handwritten as well, which of course doesn't OCR, but at least are on forms that have some typed text. Right now about 3 minutes of that 7 minutes is OCR processing, and 2 minutes is cropping, with the rest split amongst splitting, converting and reassembling. If I drop the cropping and try to OCR the entire page it jumps to over 11 minutes. Bet you guys never thought ocrad would be used for that, eh? ;-) So, anyone have an idea that might speed up the process? I'm already using an AMD 64 3200+ with a 64 bit kernel. Right now I'm at about 70 pages per minute, but I'd like to get it to 100 pages per minute processed. At that speed I still won't keep up with the scanners, but I should be able to catch back up every night, or at least over the weekend. Yes, you read that right. I'll be processing as much as 150,000 pages per day on one server, and am designing this process so it could be clustered to handle more. -Tony _______________________________________________ Bug-ocrad mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-ocrad
