You might consider using the lines to identify columns in the table using OpenCV. There is an example here [1] of removing lines, but you can also use the same approach to try to identify line coordinates. With the coordinates, you could then try to extract the columns of numbers and work through them from there. Your sample image is challenging, my sense is that Tesseract could do a lot if you can segment the table into individual numbers and leverage Tesseract's accuracy metrics. You would probably want a lot of very consistent layouts to justify the effort to do that.
Best, art --- 1. https://stackoverflow.com/questions/60521925/how-to-detect-the-horizontal-and-vertical-lines-of-a-table-and-eliminate-the-noi From: [email protected] <[email protected]> On Behalf Of Sean Pham Sent: Thursday, October 3, 2024 1:17 PM To: tesseract-ocr <[email protected]> Subject: [tesseract-ocr] Table Extraction using Tesseract You don't often get email from [email protected]<mailto:[email protected]>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hello, I am trying to use Tesseract OCR to read archaic climate data in the form of a table. I am new to this technology and would appreciate any guidance. What I have tried so far: I am using packages like img2table to extract a table structure and then using Tesseract to identify text. There are various issues with both processes and the result is not very accurate. Questions 1) Is this solution using Tesseract feasible? 2) Is there any process / technologies that may be beneficial for this use case? Any insight / advice would be greatly appreciated! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09b8e0d0-cea5-4c86-95cf-f6b8e95fd0d4n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/09b8e0d0-cea5-4c86-95cf-f6b8e95fd0d4n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB9902BBE6E9890A6F90D1AC1FDC722%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.

