[tesseract-ocr] Re: Tesseract recognition issues.

Yaofu Zhou Mon, 20 May 2024 22:10:36 -0700

It is going to be a project for you but one way to achieve your goal is to 
fine-tune the model using a custom training set - 
1. You would procedurally generate a set (a few thousand would be a good 
start) of images of similar content with various amounts of dots, as well 
as the corresponding text files that label the ground truth for the images.
2. You would fine-tune the specific Tesseract OCR model you are using  (deu 
in your case) with the training set generated. Tesseract's GitHub has a 
tool "Tesstrain" that can help with the training process.
You should be able to achieve most of the project with the help of GPT or 
Claude.
Sorry if this is not the solution you were looking for.
On Tuesday, May 14, 2024 at 12:52:15 AM UTC-4 [email protected] wrote:


> I have some problems with trailing dots in table of contents.
> The attached picture is recognized but all the dots are interpreted as 
> random caraters:
>
> example:
> where I have this text in the picture
> Cochemiea (Teil 1) .......................................... 3
>
> I get the following text after OCR
> Cochemiea (Teil 1) .....::: 2222 see essen eennseenneeneeener nen
>
> As one can see statring from dots the rest ol the line is wrong evn the 
> last number 3 is missing 
>
> Anyone have an idea about how to fix this?.
>
> I use following command
> tesseract --dpi 300 -l deu --oem 1 Kakt_Sukk-1986-1_02.jpg 
> Kakt_Sukk-1986-1_02 txt
>
> on Linux Debian 12
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/80976183-99a2-4c64-839c-49a771f479ccn%40googlegroups.com.

[tesseract-ocr] Re: Tesseract recognition issues.

Reply via email to