There are some interesting tools from Ben Lee's Newspaper Navigator project you 
might want to look at [1], and the OCR-D project includes both Tesseract and 
layout detection support [2]. A good and fairly recent overview of what's 
available for historical newspaper digitization projects can be found here [3]. 
If you want to work from HOCR files directly, you might be able to leverage 
font metrics for identifying headlines and advertisements, since the text would 
typically be bigger, but I think most approaches to newspaper segmentation work 
from images.

Best,

art
---
1. https://github.com/LibraryOfCongress/newspaper-navigator
2. https://ocr-d.github.io<https://ocr-d.github.io/>
3. https://drops.dagstuhl.de/entities/document/10.4230/DagRep.12.7.112

From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf 
Of shacky
Sent: Saturday, November 25, 2023 10:55 AM
To: tesseract-ocr@googlegroups.com
Subject: [tesseract-ocr] Newspaper segmentation techniques

You don't often get email from shack...@gmail.com<mailto:shack...@gmail.com>. 
Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Hello everyone,
I'm using tesseract l to ocrize some newspapers and it works very well.

I am making some researches about how I could have some kind of automatic 
segmentation of singles articles into a newspaper page post processing 
generated HOCR files and I found some academics papers which speaks about 
neural networks and machine learning techniques.

I am writing this message because I am wondering if there are some "de facto" 
working techniques about this or maybe some ready to run programs which make 
some post processing after Tesseract.

I know that maybe this is not really related to Tesseract, but I cannot find 
any other better place where I could ask this.

Could you help me please? Do you have any idea or hint about how/where to start 
to reach the goal?

Thank you very much!
Bye
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com<https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB9889DA8E8980849066A5FA3EDCBCA%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM.

Reply via email to