Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

Des Bw Wed, 01 Nov 2023 06:20:14 -0700

 
*1. using sythetic data: *
What can you do if you do not have a data that is confirmed to be accurate?
The only way around that I know  is to use sythetic data.  That is: you 
generate the images from the texts using text2image script. You then train 
from that one. The accuracy of the result model is not going to be 
perfect because the actual data is messier than the syntactic data. But, 
you can try  different methods to get better accuracy: 
(a) by training from a network: that is you can cut the top layer of a 
working model, and train from that one. 
(b) configure text2image script to add noise to the sythetic data so that 
it will be similar to the actual images. 
(c) using larger dataset
etc


*2) the hocr hack: *
- I havn't tried this method myself. But, I read in GitHub that Shree has 
some kind of hack (script) that uses horc script inside tesseract.
https://github.com/tesseract-ocr/tesstrain/issues/7
a. First, ocr the images using the standard model  to an hocr format. 
b) he then breaks down the hocr format to box, tif, text files
c) he then compares the text files with the images, and manually corrects 
faulty ones. 
This one also requires a lot of manual work because the standard model will 
miss a lot of characters. 

3) Alternatively, you can try other ocr engines such as *EasyOCr*. Some 
people say EasOCR is better to ocr those kinds of images: while tesseract 
is better for scanned docs. 

On Wednesday, November 1, 2023 at 3:57:48 PM UTC+3 khanht...@khu.ac.kr 
wrote:

> Thank you for your responses. Regarding my question and referring to the 
> official documentation at  
> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the 
> generated .box files for LSTM-based training have the *same coordinates* for 
> every character because they use line-level boxes instead of 
> character-level boxes.
> Also, I have a couple of concerns:
> 1) I'm working on license plate recognition and have 80K car plate images 
> with noise. Most of the .box files generated by lstmbox are incorrect 
> compared with ground truth text. Manually editing all these box files will 
> be very time-consuming. Do you have any suggestions to shorten the time?
> 2) Do I need to manually check all 80K box files to ensure the accuracy of 
> my training data?
>
> On Wednesday, November 1, 2023 at 9:21:36 PM UTC+9 desal...@gmail.com 
> wrote:
>
>> "Please note that box files generated using makebox config file are OK 
>> for training legacy models but not for LSTM training.". Makebox is the 
>> tool included inside tesseract to generate box files. It looks like that 
>> was used for the legacy model. For the current model, text2image is the way 
>> to do it.  
>>
>> On Wednesday, November 1, 2023 at 3:02:28 PM UTC+3 Des Bw wrote:
>>
>>>
>>> I don't know what you are trying to do. I am not familiar with this 
>>> method of box generation. But, I think the command you are running is 
>>> supposed to generate them with the same coordinates. Look at the example 
>>> here:  https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
>>>
>>>
>>> On Wednesday, November 1, 2023 at 2:57:46 PM UTC+3 elvi...@gmail.com 
>>> wrote:
>>>
>>>> On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <
>>>> khanht...@khu.ac.kr> wrote:
>>>>
>>>>>
>>>> Are you trying to generate box files from the images (tif files)?
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/979b9f46-9504-4fac-821e-225442c0ab2an%40googlegroups.com.

Re: [tesseract-ocr] LSTM-based training produces .box files with the same coordinates

Reply via email to