Re: [tesseract-ocr] Re: Training from Scratch

Lorenzo Bolzani Mon, 27 Nov 2023 07:52:42 -0800

Hi Simon, yes, I think the instructions you can give to the segmentation
step are quite limited, mostly the PSM parameter and I suppose a few minor
ones. There is something about tables but I've never used it and yours
might be too small for this to work. Yes, you should be able to see what is
happening looking at the HOCR file.


You could also try the attached script, it was made for the 4.x version but
might work with 5.x too. It draws boxes around letters according to the
tesseract output. I'm attaching the output on a simple text and on several
crops from your image: only in the clean one you can see the text boxes.
You can do the same from the HOCR file.

Yes, you still need to fine tune for the new character. I was able to train
up to 57k iterations still improving the results on a test dataset. You
need to fine tune including the new symbols AND all the other symbols you
expect to recognize in the training dataset.


I'm not sure if you are using something like this:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset  "$@"

if so, you can replace it with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

and the new model will output only the characters that are present in your
new dataset (for example to discard lower case letters, the < character, %,
!, #, etc.)

Also, if you do not need to recognize the < symbol, you could reuse this
rather than adding a new one completely. I mean that when you generate the
images with the "angle" symbol you put < in the transcription. Maybe it
helps, maybe it won't.



Bye

Lorenzo




Il giorno sab 25 nov 2023 alle ore 12:25 Simon <smong5...@gmail.com> ha
scritto:

> Yes in general I want to recognice this part  "< 0,05 A" except that the
> < ist actually  ∠  the character for angularity.
>
> The segmentation process of tesseract can't be edited right? So you mean I
> would need to make an Tesseract independent program that localizes the
> boxes crops them out and feeds them to Tesseract? In that case I still
> would need to train Tesseract for recognizing  ∠ .  So I am still
> wondering how to train this sign properly.
>
> Because you asked if the isolation step is able to isolate it, I can check
> this by looking at the hocr information right?
>
>
>
> Lorenzo Blz schrieb am Freitag, 24. November 2023 um 10:45:14 UTC+1:
>
>> Hi Simon,
>> if I understand correctly how tesseract works, it follows this steps:
>>
>> - it segments the image into lines of text
>> - it then takes each individual line and slides a small window, 1px wide
>> I think, over it, from one end to the other. For each step the model
>> outputs a prediction. The model, being an bidirectional LSTM has some
>> memory of the previous and following pixel columns.
>> - all these predictions are converted into characters using beam search
>>
>> Please correct me if I got it wrong. So the first thing I think looking
>> at your picture is the segmentation step. Do you want to read the "< 0,05
>> A" block only? Is the segmentation step able to isolate it? This is the
>> first thing I would try to understand.
>> Also your sample image for "<" has a very different angle to the one
>> before 0,05.
>>
>> In this case a would try to do a custom segmentation, looking for
>> rectangular boxes of a certain height, aspect ratio, etc. Then cropping
>> these out (maybe dropping the rectangular box and the black vertical lines)
>> and feed them to tesseract. This of course requires custom programming.
>>
>> This might give good results even without fine tuning. I would try this
>> manually with GIMP first.
>>
>>
>> Also I suppose you are not going to encounter a lot of wild fonts into
>> these kind of diagrams. The more fonts you use, the harder the training. I
>> would focus on very few fonts, even one. I would start with exactly one
>> font and train on these to see quickly if my training setup/pipeline is
>> working. And if the training results reflect onto the diagrams later. If
>> the model error rate is good on the individual text lines and it is bad on
>> the real images it might be a segmentation problem that training cannot
>> fix. Or the problem might be the external box, that I suppose you do not
>> have in your generated data.
>>
>> Ideally, I would use real crops from these diagrams rather than images
>> from text2image.
>>
>> Also distinguishing 0 from O with many fonts is very hard. Often you have
>> domain knowledge that can help you to fix these errors in post, for example
>> 0,O5 can be easily spotted and fixed. You can, for example, assume that
>> each box contains only one kind of data and guess the most likely one from
>> this or from the box sequence, etc.
>>
>> I got good results with 20k samples (real world scanned docs, multi
>> fonts). 10k seems reasonable, I also assume your output "characters set" is
>> very small, like the numbers and a few capital letters and a couple of
>> symbols (no %, ^, &, {, etc.).
>>
>>
>>
>> Lorenzo
>>
>> Il giorno gio 23 nov 2023 alle ore 10:16 Simon <smon...@gmail.com> ha
>> scritto:
>>
>>> If I need to train new characters that are not recognized by a default
>>> model, is fine tuning in this case the right approach?
>>> One of these characters ist the one for angularity:  ∠
>>>
>>> This symbols appear in technical drawings and should be recognised in
>>> those. E.g. for the scenario in the following picture tesseract should
>>> reconize this symbol.
>>>
>>>
>>>
>>> [image: angularity.png]
>>>
>>> Also here is one of the pngs I tried to train with:
>>> [image: angularity_0_r0.jpg]
>>> They all look pretty similar to this one. Things that change are the
>>> angle, the propotion and the thickness of the lines. All examples have this
>>> 64x64 pixel box around it.
>>>
>>>
>>> Is Fine Tuning for this scenario the right approach as I only find
>>> information for fine tuning for specific fonts. For fine tune also the
>>> "tesstrain" repository would not be needed as it is used for training from
>>> scratch, correct?
>>> desal...@gmail.com schrieb am Mittwoch, 22. November 2023 um 15:27:02
>>> UTC+1:
>>>
>>>> From my limited experience, you need a lot more data than that to train
>>>> from scratch. If you can't make more than that data, you might first try to
>>>> fine tune:and then train by removing the top layer of the best model.
>>>>
>>>> On Wednesday, November 22, 2023 at 4:46:53 PM UTC+3 smon...@gmail.com
>>>> wrote:
>>>>
>>>>> As it is not properly possible to combine my traineddata from scratch
>>>>> with an existing one, I have decided to also train my traineddata model
>>>>> numbers. Therefore I wrote a script which synthetically generates
>>>>> groundtruth data with text2image.
>>>>> This script uses dozens of different fonts and creates numbers for the
>>>>> following formats.
>>>>> X.XXX
>>>>> X.XX
>>>>> X,XX
>>>>> X,XXX
>>>>> I generated 10,000 files to train the numbers. But unfortunately
>>>>> numbers get recognized pretty poorly with the best model. (most of times
>>>>> only "0."; "0" or "0," gets recognized)
>>>>> So I wanted to ask if It is not enough training (ground truth data)
>>>>> for proper recognition when I train several fonts.
>>>>> Thanks in advance for you help.
>>>>>
>>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a904604-f0b7-48ef-a4b2-cf1e97123041n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/31d6a1f5-d114-485b-b6b3-897c57616783n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLw5NG4Tr5NB%3DvWG_AG75_oM_J1odD_%3DogCv04LKtd5eBA%40mail.gmail.com.

'''
Created on May 16, 2018

@author: trz
'''

import logging
import sys
import time

import tesserocr
from collections import deque
from multiprocessing import Lock
from tesserocr import PyTessBaseAPI, RIL, iterate_level
from operator import itemgetter

import cv2
import numpy as np


# DEFAULT_LANG="ita15kf"
# DEFAULT_LANG="elett_16000"
DEFAULT_LANG = "eng"

colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
yellow = (255, 255, 0)
magenta = (255, 0, 255)
gray = (100, 100, 100)

DEBUG = False


def recognize_text(raw_img):

    lang = DEFAULT_LANG
    psm_mode = tesserocr.PSM.SINGLE_BLOCK

    api = PyTessBaseAPI(lang=lang)
    logging.info("Tesseract version %s", api.Version())
    api.Init(lang=lang)
    #api.SetPageSegMode(psm_mode)       # single line

    api.SetImageBytes(raw_img.tobytes(), raw_img.shape[1], raw_img.shape[0], 1, raw_img.shape[1])

    api.Recognize()
    ri = api.GetIterator()

    prev_symbol_end = -1
    prev_symbol = None
    draw_img = cv2.cvtColor(raw_img, cv2.COLOR_GRAY2BGR)
    res = ""
    for i, r in enumerate(iterate_level(ri, RIL.SYMBOL)):

        symbolBounds = r.BoundingBox(RIL.SYMBOL)
        conf = r.Confidence(RIL.SYMBOL)

        new_word = r.IsAtBeginningOf(RIL.WORD)
        new_line = r.IsAtBeginningOf(RIL.TEXTLINE)
        #print("NW/NL", new_word, new_line)
        if new_word:
            res += " "
        if new_line:
            res += "\n"

        if symbolBounds is None:
            print("symbolBounds is None", conf)
            continue

        symbol = r.GetUTF8Text(RIL.SYMBOL)

        res += symbol

        cv2.rectangle(draw_img, (symbolBounds[0], symbolBounds[1]), (symbolBounds[2], symbolBounds[3]),
                      colors[i % 3])
        if i % 2 == 0:
            cv2.rectangle(draw_img, (symbolBounds[0], 0), (symbolBounds[2], 1), yellow, 2)
        else:
            cv2.rectangle(draw_img, (symbolBounds[0], draw_img.shape[0] - 2),
                          (symbolBounds[2], draw_img.shape[0] - 2), magenta, 2)

        # space between symbols
        curr_symbol_start = symbolBounds[0]
        if prev_symbol_end == -1:
           space_between_symbols = 0
        else:
            space_between_symbols = curr_symbol_start - prev_symbol_end
        print("From", prev_symbol, "to", symbol, space_between_symbols)
        symbol_h = symbolBounds[3] - symbolBounds[1]
        half_h = symbol_h // 2 + symbolBounds[1]
        th = 12
        color = (200,200,200) if space_between_symbols < th else (0, 200, 0)

        #cv2.rectangle(draw_img, (prev_symbol_end, half_h-2), (curr_symbol_start, half_h+2),
        #              color, -1)

        prev_symbol_end = symbolBounds[2]
        prev_symbol = symbol


    print(res)

    #cv2.imwrite("boxes.jpg", draw_img)
    cv2.imshow("ocr boxes", draw_img)
    cv2.waitKey(0)


if __name__ == '__main__':
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    filename = sys.argv[1]
    img = cv2.imread(filename, 0)

    res = recognize_text(img)

Re: [tesseract-ocr] Re: Training from Scratch

Reply via email to