Re: [tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

Zdenko Podobny Thu, 23 Jun 2022 07:58:15 -0700

Can please provide some examples of input images?
It would be much easier for other user to test your problem and suggest
some solution.


Zdenko


št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> Dear all,
> we are attempting to read bank statements with tesseract (via tess4j,
> version 4.6.0 using libtesseract 4.1.3). These statements are formalized
> letters where the crucial information for us appears at pre-defined
> locations. Among other information, we are interested in extracting the
> ISIN (international securities identifier), which is a alphanumeric code
> consisting of a two-letter country code, nine arbitrary letters
> or digits and a numeric check digit.
>
> When attempting to extract this information with tesseract, we observe
> patterns of read errors by tesseract such as
>
> - zeros in the ISIN's padding appear as 0O combinations in tesseract's
> output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
> - the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
> - letters in the country code (first two characters of the ISIN) are
> misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.
>
> These problems appear arbitrarily for such documents coming from different
> banks using different fonts. Preliminary tests using a user patterns file
> where we specify a pattern for the ISIN have had no effect, the ocr result
> is exactly the same as without custom pattern file. Our pattern file
> contains this line:
>
> \A\A\c\c\c\c\c\c\c\c\c\d
>
> and we use it by setting the "user_patterns_file" variable like so
>
> Tesseract tesseract = new Tesseract();
> tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");
>
> Anyhow, my questions:
>
> - is this the correct way to configure user patterns with tess4j? Related
> to that, do user patterns work when using tesseract 4.1.3 in LSTM mode (as
> we do currently)? I am aware of a number of issues (see
> https://github.com/tesseract-ocr/tesseract/issues/403 resp.
>   https://github.com/tesseract-ocr/tesseract/issues/960) and PR
> https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to
> add it for LSTM but am not sure what the current status is.
> - is using a pattern the right way to go to augment tesseract's accuracy
> for alphanumeric identifiers like an ISIN? Does this yield positive results
> even when the alphanumeric
>   identifier is part of a longer text and not the only thing that is
> present in the picture?
> - what other approaches to improve tesseract's accuracy when recognizing
> alphanumeric characters exist? I am aware of user dictionaries, but have my
> doubts this is a feasible approach   for us given the large number of
> existing ISINs (> 3 million).
>
> Thanks in advance for any hints,
> Stefan
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y8W3i%3DWqp9Hrx%3DouNvQJ%3D-K8xZJbKFMgFHznpCPyh2mA%40mail.gmail.com.

Re: [tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

Reply via email to