The technical term for these is "drop-caps 
<https://en.wikipedia.org/wiki/Initial>," which is useful to know if you 
want to Google for it.

It's pretty dated now, but Ray's 2007 description 
<https://tesseract-ocr.github.io/docs/tesseracticdar2007.pdf> of the line 
finding algorithm says: "Assuming that page layout analysis has already 
provided text regions of a roughly uniform text size, a simple percentile 
height filter *removes drop-caps* and vertically touching characters." 
[Emphasis added]

It looks like the commercial package Omnipage supports drop caps. Teaching 
Tesseract to recognize them would involve tweaking the internal 
segmentation and line finding algorithms, not additional training. Another 
approach would be to do your own segmentation to identify them and 
recognize them separately as single letters.

There's some general background which may be interesting/useful here: 
https://how-ocr-works.com/OCR/line-segmentation.html

Tom


On Wednesday, August 5, 2020 at 4:58:20 AM UTC-4 tlit...@gmail.com wrote:

> That's right, that initial "TO" and this is just a fraction of the text, 
> there are dozens of examples like "TO" on a single page. But since it 
> spreads to two lines there's nothing I can do I assume?
>
> On Tuesday, August 4, 2020 at 7:39:21 PM UTC+2 zdenop wrote:
>
>> Not sure what do you mean...
>>
>> tesseract big_low.jpeg - --psm 6
>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>> FY, MINERS.—TO LET, ON LEASE, on such terms as may
>> be agreed on, the MINERALS in the ESTATE of KNOCKSHINNOCK, lying in
>> the parish of New Cumnock, and county of Ayr. Acdead vein has been lately 
>> discovered
>>
>> Problem is there only with initial TO which is IMO caused by T with size 
>> of two lines with following smaller size letters.
>>
>> Zdenko
>>
>>
>> ut 4. 8. 2020 o 13:07 tlit...@gmail.com <tlit...@gmail.com> napísal(a):
>>
>>> Hello,
>>>
>>> Is it possible to train for bigger fonts in the beginning of the 
>>> sentences, since it seems that tesseract always misses them.
>>>
>>> Thanks in advance.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/0f97a784-e8e4-4c05-8296-b95dc2211e78n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/0f97a784-e8e4-4c05-8296-b95dc2211e78n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e694846c-bcee-40d5-960e-4b50ceb4dd94n%40googlegroups.com.

Reply via email to