There are some interesting tools from Ben Lee's Newspaper Navigator project you might want to look at [1], and the OCR-D project includes both Tesseract and layout detection support [2]. A good and fairly recent overview of what's available for historical newspaper digitization projects can be found here [3]. If you want to work from HOCR files directly, you might be able to leverage font metrics for identifying headlines and advertisements, since the text would typically be bigger, but I think most approaches to newspaper segmentation work from images.
Best, art --- 1. https://github.com/LibraryOfCongress/newspaper-navigator 2. https://ocr-d.github.io<https://ocr-d.github.io/> 3. https://drops.dagstuhl.de/entities/document/10.4230/DagRep.12.7.112 From: tesseract-ocr@googlegroups.com <tesseract-ocr@googlegroups.com> On Behalf Of shacky Sent: Saturday, November 25, 2023 10:55 AM To: tesseract-ocr@googlegroups.com Subject: [tesseract-ocr] Newspaper segmentation techniques You don't often get email from shack...@gmail.com<mailto:shack...@gmail.com>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hello everyone, I'm using tesseract l to ocrize some newspapers and it works very well. I am making some researches about how I could have some kind of automatic segmentation of singles articles into a newspaper page post processing generated HOCR files and I found some academics papers which speaks about neural networks and machine learning techniques. I am writing this message because I am wondering if there are some "de facto" working techniques about this or maybe some ready to run programs which make some post processing after Tesseract. I know that maybe this is not really related to Tesseract, but I cannot find any other better place where I could ask this. Could you help me please? Do you have any idea or hint about how/where to start to reach the goal? Thank you very much! Bye -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com<mailto:tesseract-ocr+unsubscr...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com<https://groups.google.com/d/msgid/tesseract-ocr/CAPz3gmk6TOOYUeXBMBN33eaC50bOdRY7c98oThXEpkyP8WBtig%40mail.gmail.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB9889DA8E8980849066A5FA3EDCBCA%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM.