This kind of variability is a bit of a problem, and it seems to occur when the image is of insufficient quality, or the font is far from the training data.At some point, we may find a solution, but for now, the best solution is to retrain on the data you want to recognize.
Ray. On Thu, May 28, 2009 at 3:31 AM, TC <tooca...@gmail.com> wrote: > > I am playing with scanning invoices to save on key strokes and > currently evaluating Tesseract. I have two invoices from the same > supplier. Both scanned with the same settings through the same > scanner. The quality of the paper documents is similar. The only > difference between the two documents is the data content, numbers, > products, prices. > > Running different images through Tesseract from the same source > produces significantly different results. > > The first line of sample "A" is shown below: > > P FIBEMI; go; 1 _ 1005824227 ` > > The first line of sample "B" is below: > > REMIT TO‘ _ 1005822166 " > > > > Another example is from sample "A": > > 1`0TAL AMOUNT DUE > > > The same text from sample "B": > > TUTAL AMOUNT UUE > > > My plan was to take the output and map it to a tab delimited text file > for subsequent processing. I have written a small java program to > parse the OCR output using string processing and pattern recognition > to identify specific bits of data in the OCR output. For example find > the index of "REMIT TO" and then identify the substring of data using > the index value of "REMIT TO" . > > My problem is in order to to parse the output with any degree of > predictability I need consistency in the OCR output. Not getting that > right now. The string "REMIT TO" is returned from OCR as " P FIBEMI; > go; 1" and "REMIT TO‘". > > Scanner is set on Grayscale at 300dpi . Have tried Black and White at > 300 dpi with similar results. > > Are there significant variances between scanners in terms of image > quality? I am using a Canon multifunction.If I was to go to an HP or > something similar would I get more consistent results. > > Any tips on improving consistency between the OCR > > Thanks. > > TC > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---