My last question "Is it possible to train osd?" does not really make much sense. What I meant to write was: "Is it possible to train/change the psm mode to something else that is not included in the built in psm modes of Tesseract?"
An Keilha schrieb am Freitag, 14. August 2020 um 12:25:30 UTC+2: > Hello, > I am using Tesseract 4.1.1 via the command line (input and output files > are attached): > > tesseract DE000029711094U1-8.tif > DE000029711094U1-8_tif-deu-best-bullets-missing -l deu --psm 3 hocr > > The traineddata from https://github.com/tesseract-ocr/tessdata_best is > used. > > The problem with the result is that the numbers on the left (bullets) are > missing (see PageViewer screenshot attached) > > If I change page segmentation from the default to "--psm 12" (for sparse > text) the numbers are there, but page segmentation is poor (because it is > not actually sparse text). Moreover, in general I cannot really use "--psm > 12", because some of the pages I do OCR on have layouts that can only > properly handled by "--psm 3". > > My my configs/hocr file looks like the following: > > tessedit_create_hocr 1 > hocr_font_info 1 > > I have also tried setting parameters like: > > tessedit_zero_rejection 1 > tessedit_zero_kelvin_rejection 1 > > Nothing improved the recognotion of the numbers on the left. > > What should I try next? Are there any parameters I should try? Is it > possible to train osd? > > Regards > Anne > > PS.: I had to zip my filed because Google won't let me upload them > otherwise. :-) > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6db5ffdc-63a9-4464-bc2c-51dcbf62a95dn%40googlegroups.com.