Can please provide some examples of input images? It would be much easier for other user to test your problem and suggest some solution.
Zdenko št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > Dear all, > we are attempting to read bank statements with tesseract (via tess4j, > version 4.6.0 using libtesseract 4.1.3). These statements are formalized > letters where the crucial information for us appears at pre-defined > locations. Among other information, we are interested in extracting the > ISIN (international securities identifier), which is a alphanumeric code > consisting of a two-letter country code, nine arbitrary letters > or digits and a numeric check digit. > > When attempting to extract this information with tesseract, we observe > patterns of read errors by tesseract such as > > - zeros in the ISIN's padding appear as 0O combinations in tesseract's > output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88 > - the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc. > - letters in the country code (first two characters of the ISIN) are > misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI. > > These problems appear arbitrarily for such documents coming from different > banks using different fonts. Preliminary tests using a user patterns file > where we specify a pattern for the ISIN have had no effect, the ocr result > is exactly the same as without custom pattern file. Our pattern file > contains this line: > > \A\A\c\c\c\c\c\c\c\c\c\d > > and we use it by setting the "user_patterns_file" variable like so > > Tesseract tesseract = new Tesseract(); > tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern"); > > Anyhow, my questions: > > - is this the correct way to configure user patterns with tess4j? Related > to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as > we do currently)? I am aware of a number of issues (see > https://github.com/tesseract-ocr/tesseract/issues/403 resp. > https://github.com/tesseract-ocr/tesseract/issues/960) and PR > https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to > add it for LSTM but am not sure what the current status is. > - is using a pattern the right way to go to augment tesseract's accuracy > for alphanumeric identifiers like an ISIN? Does this yield positive results > even when the alphanumeric > identifier is part of a longer text and not the only thing that is > present in the picture? > - what other approaches to improve tesseract's accuracy when recognizing > alphanumeric characters exist? I am aware of user dictionaries, but have my > doubts this is a feasible approach for us given the large number of > existing ISINs (> 3 million). > > Thanks in advance for any hints, > Stefan > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y8W3i%3DWqp9Hrx%3DouNvQJ%3D-K8xZJbKFMgFHznpCPyh2mA%40mail.gmail.com.