[tesseract-ocr] Re: Tesseract ignores numbers/bullets

An Keilha Fri, 14 Aug 2020 08:22:15 -0700

My last question "Is it possible to train osd?" does not really make much 
sense. What I meant to write was: "Is it possible to train/change the psm 
mode to something else that is not included in the built in psm modes of 
Tesseract?"


An Keilha schrieb am Freitag, 14. August 2020 um 12:25:30 UTC+2:

> Hello,
> I am using Tesseract 4.1.1 via the command line (input and output files 
> are attached):
>
> tesseract  DE000029711094U1-8.tif 
> DE000029711094U1-8_tif-deu-best-bullets-missing -l deu --psm 3 hocr
>
> The traineddata from https://github.com/tesseract-ocr/tessdata_best is 
> used.
>
> The problem with the result is that the numbers on the left (bullets) are 
> missing (see PageViewer screenshot attached)
>
> If I change page segmentation from the default to "--psm 12" (for sparse 
> text) the numbers are there, but page segmentation is poor (because it is 
> not actually sparse text). Moreover, in general I cannot really use "--psm 
> 12", because some of the pages I do OCR on have layouts that can only 
> properly handled by "--psm 3".
>
> My my configs/hocr file looks like the following:
>
> tessedit_create_hocr 1
> hocr_font_info 1
>
> I have also tried setting parameters like:
>
> tessedit_zero_rejection 1
> tessedit_zero_kelvin_rejection 1
>
> Nothing improved the recognotion of the numbers on the left.
>
> What should I try next? Are there any parameters I should try? Is it 
> possible to train osd?
>
> Regards
> Anne
>
> PS.: I had to zip my filed because Google won't let me upload them 
> otherwise. :-)
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6db5ffdc-63a9-4464-bc2c-51dcbf62a95dn%40googlegroups.com.

[tesseract-ocr] Re: Tesseract ignores numbers/bullets

Reply via email to