This kind of variability is a bit of a problem, and it seems to occur when
the image is of insufficient quality, or the font is far from the training
data.At some point, we may find a solution, but for now, the best solution
is to retrain on the data you want to recognize.

Ray.

On Thu, May 28, 2009 at 3:31 AM, TC <tooca...@gmail.com> wrote:

>
> I am playing with scanning invoices to save on key strokes and
> currently evaluating Tesseract. I have two invoices from the same
> supplier. Both scanned with the same settings through the same
> scanner. The quality of the paper documents is similar. The only
> difference between the two documents is the data content, numbers,
> products, prices.
>
> Running different images through Tesseract from the same source
> produces significantly different results.
>
> The first line of sample "A" is shown below:
>
>     P FIBEMI; go; 1 _ 1005824227  `
>
> The first line  of sample "B" is below:
>
>     REMIT TO‘ _ 1005822166             "
>
>
>
> Another example is from sample "A":
>
>     1`0TAL AMOUNT DUE
>
>
> The same text from sample "B":
>
>    TUTAL AMOUNT UUE
>
>
> My plan was to take the output and map it to a tab delimited text file
> for subsequent processing. I have written a small java program to
> parse the OCR  output  using string processing and pattern recognition
> to identify specific bits of data in the OCR output. For example find
> the index of "REMIT TO" and then  identify the substring of data using
> the index value of "REMIT TO"  .
>
> My problem is in order to to parse the output with any degree of
> predictability I need consistency in the OCR output. Not getting that
> right now. The string "REMIT TO" is returned from OCR as " P FIBEMI;
> go; 1" and "REMIT TO‘".
>
> Scanner is set on Grayscale at 300dpi . Have tried Black and White at
> 300 dpi with similar results.
>
> Are there significant variances between scanners in terms of image
> quality? I am using a Canon multifunction.If I was to go to  an HP or
> something similar would I get more consistent results.
>
> Any tips on improving consistency between the OCR
>
> Thanks.
>
> TC
>
>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to