Re: [tesseract-ocr] Should box include surrounding space?

'Danny Wilson' via tesseract-ocr Wed, 18 Oct 2023 18:15:19 -0700

Because of some issues with licensed fonts not working with text2image, we 
wrote our own image and box file generator in Swift on the Mac.


We use that to generate a data set for 100,000 text lines and feed that into 
the regular training on Linux.

Using a non-licensed font, I checked what box text2image generated for the 
FULLWIDTH COMMA (should've done that earlier!)



So it looks like text2image uses the top base line for the box, which extends 
only as far down as the lowest extent of the glyph.  Such a box would 
differentiate between FULLWIDTH COMMA and COMMA if the font vertically centers 
FULLWIDTH COMMA.  

If the font renders FULLWIDTH COMMA on the text baseline, then the model would 
get confused between FULLWIDTH COMMA and COMMA since both are down on the 
baseline.

How does tesseract handle the whitespace to the left/right of a character?  Is 
there some kind parameter to set or would training with data containing both 
(baseline) FULLWIDTH COMMA and COMMA work?

Danny



> On 18 Oct 2023, at 20:43, Des Bw <desaleg...@gmail.com> wrote:
> 
> You need a large  data. That is all. 
> If you can collect a lot of text lines that contain all those types of 
> commas: and produce the training material using text2image (synthetic data) 
> for each font, I am pretty sure Tesseract will learn all of them with no 
> problem. 
> 
> On Wednesday, October 18, 2023 at 12:35:01 PM UTC+3 Danny wrote:
>> There are a few "commas" used in CJK which makes it complicated for me.
>> 
>> FULLWIDTH COMMA U+FF0C (link <https://www.compart.com/en/unicode/U+FF0C>) 
>> which might have the glyph in the center of the box or in the lower left 
>> corner depending on the font:
>> 
>>  
>> 
>> HALFWIDTH IDEOGRAPHIC COMMA U+FF64 (link 
>> <https://www.compart.com/en/unicode/U+FF64>) which (as far as I can tell) 
>> will always be in the bottom corner regardless of font. (used to enumerate 
>> sequences)
>> 
>> 
>> COMMA U+002C, (link <https://www.compart.com/en/unicode/U+002C>) which isn't 
>> part of formal CJK languages but in practice is used all the time
>> 
>> 
>> So I'd like to train to recognize the three types of commas so the OCR 
>> output is matches the input images.  "FULLWIDTH COMMA" is a problem because 
>> the glyph position in the box is different depending on the font.  Hence my 
>> question "where and how big is the box?"
>> 
>> 
>> 
>> In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 is 
>> a different font.  Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH 
>> IDEOGRAPHIC COMMA U+FF64.
>> 
>> What's the best way to train given those types of input and the expected 
>> output?
>> 
>> Danny
>> On Wednesday, October 18, 2023 at 1:22:25 PM UTC+8 desal...@gmail.com <> 
>> wrote:
>>> If the space is included in the training across the board, the model might 
>>> not recognize  the comma when it appears without space  (as in numbers: 
>>> 23,334). 
>>> 
>>> On Wednesday, October 18, 2023 at 5:29:13 AM UTC+3 Danny wrote:
>>>> For purposes of training, I'm wondering if the box for a character should 
>>>> include the surrounding space. 
>>>> 
>>>> In particular for the CJK "FULLWIDTH COMMA", should the box be the red or 
>>>> green rectangle? 
>>>> 
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/tesseract-ocr/FJyyTpX1d7k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> tesseract-ocr+unsubscr...@googlegroups.com 
> <mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/df5cc0ce-7af3-4b57-a911-06fa18217e52n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/df5cc0ce-7af3-4b57-a911-06fa18217e52n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7DA5E773-AD3C-4F93-B804-E44C58989F89%40mac.com.

Re: [tesseract-ocr] Should box include surrounding space?

Reply via email to