[tesseract-ocr] make training does nothing when run

2021-01-07 Thread Keith M
I'm sure I'm making a beginner mistake here, but I'm struggling quite a bit. I've built straight from source, both version 4.1.1 and 5.0.0 on Ubuntu 18.04, and Ubuntu 20.04(fresh install, never used, but properly updated). All exhibit the same behavior. I installed all the dependencies following

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Are any of these vertical fonts? Encoding errors could be if the characters in training text are not in the unicharset. On Fri, Jan 8, 2021, 00:46 Kamui 7 wrote: > Looks like that fixed bug #1. Now it is able to successfully create 400 > pages. Do you have any ideas as to why the other 2 errors

Re: [tesseract-ocr] Easily readable Russian not recognized in language app screenshot

2021-01-07 Thread Shree Devi Kumar
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract rus.png - -l rus+eng --tessdata-dir ~/tessdata_best D 20:22 Э 5IN AROW 5IN AROW 5IN AROW Translate this sentence Translate this sentence Translate this sentence (0) Вопросы есть? (0) Вопросы есть? Вопросы есть? Апу questions Any questions Any questi

Re: [tesseract-ocr] Easily readable Russian not recognized in language app screenshot

2021-01-07 Thread 'd-ka' via tesseract-ocr
I still fail to understand why Tesseract performs so poorly. Isn’t it made for OCR in screenshots? Doesn’t it understand Russian at all? On Monday, November 2, 2020 at 5:45:41 PM UTC+1 d-ka wrote: > Well, that’d require much additional logic because the general layout > entails quite a diverse

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
Looks like that fixed bug #1. Now it is able to successfully create 400 pages. Do you have any ideas as to why the other 2 errors are occurring? On Thursday, January 7, 2021 at 11:28:12 AM UTC-6 shree wrote: > Your training text file is only 175 lines, so the rendered image fits in 4 > pages. Yo

Re: [tesseract-ocr] Removing colors

2021-01-07 Thread Zdenko Podobny
Unfortunately I am not aware of (maintained) python leptonica support (any volunteers?), but you can directly use leptonica&tesseract via cffi in python. See some examples : https://sk-spell.sk.cx/building-minimalistic-tesseract https://github.com/zdenop/SimpleTesseractPythonWrapper/blob/master/Sim

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I replaced the training text with the one from the official langdata repo and now it seems to only produce 30 pages. Is there any place to get the training text that the official jpn.traineddata was trained on? I have also checked to make sure the fonts support english and japanese as well On

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Your training text file is only 175 lines, so the rendered image fits in 4 pages. You need to use a larger text if you want more pages. Also check that your fonts support both English and Japanese as the text seems to have samples of both languages. On Thu, Jan 7, 2021, 22:40 Kamui 7 wrote: > I

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
Or you may have an old version of data/ben/checkpoints/ben_checkpoint -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I did a find command in the root directory and searched for the tesstrain script. It could only find the script that i pulled from the latest tesseract git repo. My training script calls that specific tesstrain script using a relative path so it couldn't be an older version On Thursday, January

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere. On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: > I have a script to train tesseract and I ran it on Arch Linux, Debian, and > even a docker container and they all produce

[tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Kamui 7
I have a script to train tesseract and I ran it on Arch Linux, Debian, and even a docker container and they all produce the same errors. I checked to make sure the script is correct as well. Bug 1: This happens when tesstrain runs text2image. The max pages parameter does not work at all. It en

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
Segmentation fault is usually if you are not using the tessdata_best model as Start_model On Thu, Jan 7, 2021, 20:13 Soumik Ranjan Dasgupta wrote: > Sorry, I attached the wrong log file. Please find the new one attached. > > On Thu, Jan 7, 2021 at 8:09 PM Soumik Ranjan Dasgupta < > ranjansou...@

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
ModuleNotFoundError: No module named 'bidi Install python-bidi On Thu, Jan 7, 2021, 15:45 Soumik Ranjan Dasgupta wrote: > Hi Shreeshrii, > > I took your command exactly as it is and ran it (made sure the > tessdata_best directory is present in $HOME > with best ben.traineddata) and ran into an

Re: [tesseract-ocr] Removing colors

2021-01-07 Thread Deepak Sharma
can you suggest me with an alternate for leptonica for "python & windows" On Thursday, January 7, 2021 at 1:42:28 AM UTC+5:30 zdenop wrote: > try to play with the leptonica pixAutoPhotoinvert function[1]. > quick test with following C code snippets provided attached result: > > pix = leptonica.pi

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Soumik Ranjan Dasgupta
Hi Shreeshrii, I took your command exactly as it is and ran it (made sure the tessdata_best directory is present in $HOME with best ben.traineddata) and ran into an extremely weird error. Here is the log: find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/ben/all-gt"