To be more precise with my questions: - Is the user-patterns functiontionality implemented in the tesserocr Python API of tesseract? - How exact is the syntax of specifying user patterns with the tesserocr Python API. Is SetVariable() correct and how is the path (Linux) and the attribute specified? - is there a default path, where it is lookes for the *.patterns / *.user-patterns file
With the attached code from my last message, I've tested different constellations with/without the combination of whitelist, different atrributes and path notations, which was not successfull. If I use the following notation for user patterns, it has no effect on the results independently from the entries of the *.patterns file: api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns') Does anyone has (successfully) used user patterns with the tesserocr Python API of tesseract? best wishes and thanks, Roman Am Sa., 2. März 2024 um 13:08 Uhr schrieb Zdenko Podobny <zde...@gmail.com>: > Can you please elaborate on: > > Nevertheless, user patterns is not working in the way described above. > > > > Zdenko > > > so 2. 3. 2024 o 10:45 Roman Seidel <roman.seide...@gmail.com> napísal(a): > >> Yes, sure, the input file is a snippet with a capital letter followed by >> 9 digits. The correct user pattern, corresponding to [1] is: >> >> ``\A\d\d\d\d\d\d\d\d\d`` >> >> The result of Tesseract (psm 8) is fully correct. Nevertheless, user >> patterns is not working in the way described above. >> >> For instance, I have tried to extract only the capital character with >> user patterns (not with whitelist), which is: >> >> \A >> >> In this case, the capital letter and all digits are given back by >> tesseract. >> >> I've attached my input file and the corresponding Python snippet for >> reading and proessing the image with tesserocr from [2] >> >> >> [1] >> https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197 >> [2] https://github.com/sirfz/tesserocr >> >> >> >> Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais < >> reneclai...@gmail.com>: >> >>> Can you send an example of an input document and the output of tesseract >>> as well of what should be your expectation using the pattern file. >>> >>> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <roman.seide...@gmail.com> >>> a écrit : >>> >>>> Hi all, >>>> >>>> I am currently try to use user-patterns on the PyTessBaseAPI from >>>> tesserocr [1]. >>>> >>>> What I've done is to initialize the API with: >>>> >>>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang= >>>> LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: >>>> >>>> setting the user patterns file with: >>>> >>>> api.SetVariable('user_patterns_file', >>>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >>>> >>>> Where the user patterns file contains a pattern, e.g.: >>>> >>>> \A\A\A >>>> >>>> (which means three characters in capital letters. >>>> >>>> >>>> The result, independently ,whether I use the user_patterns_file >>>> argument or not, are the same. This brings me to the question if tesserocr >>>> supports user (and word) patterns? >>>> >>>> My versions: >>>> >>>> tesserocr 2.6.2 >>>> tesseract 5.3.3 >>>> leptonica-1.83.1 >>>> libpng 1.6.34 : zlib 1.2.11 >>>> >>>> Thanks a lot for your help and best wishes, >>>> Roman >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5uQAOGF7dD%2BtP2xt93Phv9OYy6anDGLdar4gxZxEDwjYQ%40mail.gmail.com.